Remove Data Centers Remove Fashion Remove Topology
article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

Distributed training, in particular, imposes the most significant strain on data center networking infrastructure. Constructing a reliable, high-performance network infrastructure capable of accommodating this burgeoning demand necessitates a reevaluation of data center network design.

Network 132
article thumbnail

Maintaining large-scale AI capacity at Meta

Engineering at Meta

Meta is currently operating many data centers with GPU training clusters across the world. Our data centers are the backbone of our operations, meticulously designed to support the scaling demands of compute and storage. This transition has not been without its challenges.

Fashion 138
Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

How Meta trains large language models at scale

Engineering at Meta

This means we need to regularly checkpoint our training state and efficiently store and retrieve training data. Optimal connectivity between GPUs: Large-scale model training involves transferring vast amounts of data between GPUs in a synchronized fashion. requires revisiting trade-offs made for other types of workloads.

article thumbnail

Building Meta’s GenAI Infrastructure

Engineering at Meta

Today, we’re sharing details on two versions of our 24,576-GPU data center scale cluster at Meta. Custom designing much of our own hardware, software, and network fabrics allows us to optimize the end-to-end experience for our AI researchers while ensuring our data centers operate efficiently.