Bandwidth, Data Centers and Fashion - IT Networking Pro Today

A RoCE network for distributed AI training at scale

Engineering at Meta

AUGUST 5, 2024

Distributed training, in particular, imposes the most significant strain on data center networking infrastructure. Constructing a reliable, high-performance network infrastructure capable of accommodating this burgeoning demand necessitates a reevaluation of data center network design.

Network

Network Networking Topology Data Centers

Building Meta’s GenAI Infrastructure

Engineering at Meta

MARCH 12, 2024

Today, we’re sharing details on two versions of our 24,576-GPU data center scale cluster at Meta. Custom designing much of our own hardware, software, and network fabrics allows us to optimize the end-to-end experience for our AI researchers while ensuring our data centers operate efficiently.

Infiniband

Infiniband Data Centers Server Network

Join 5,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

ByteByteGo

How Meta trains large language models at scale

Engineering at Meta

JUNE 12, 2024

This means we need to regularly checkpoint our training state and efficiently store and retrieve training data. Optimal connectivity between GPUs: Large-scale model training involves transferring vast amounts of data between GPUs in a synchronized fashion. requires revisiting trade-offs made for other types of workloads.

Infiniband

Infiniband Data Centers Topology Network

Egnyte Architecture: Lessons learned in building and scaling a multi petabyte content platform

High Scalability

NOVEMBER 25, 2019

Egnyte Connect platform employs 3 data centers to fulfill requests from millions of users across the world. To add elasticity, reliability and durability, these data centers are connected to Google Cloud platform using high speed, secure Google Interconnect network. On prem data processing. Data interdependence.

Data Centers

Data Centers Server Engineering Cloud

A RoCE network for distributed AI training at scale

Building Meta’s GenAI Infrastructure

Trending Sources

How Meta trains large language models at scale

Egnyte Architecture: Lessons learned in building and scaling a multi petabyte content platform

Stay Connected