Remove Data Centers Remove Fashion Remove Infiniband
article thumbnail

How Meta trains large language models at scale

Engineering at Meta

This means we need to regularly checkpoint our training state and efficiently store and retrieve training data. Optimal connectivity between GPUs: Large-scale model training involves transferring vast amounts of data between GPUs in a synchronized fashion. requires revisiting trade-offs made for other types of workloads.

article thumbnail

Building Meta’s GenAI Infrastructure

Engineering at Meta

Today, we’re sharing details on two versions of our 24,576-GPU data center scale cluster at Meta. Custom designing much of our own hardware, software, and network fabrics allows us to optimize the end-to-end experience for our AI researchers while ensuring our data centers operate efficiently.

article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

Distributed training, in particular, imposes the most significant strain on data center networking infrastructure. Constructing a reliable, high-performance network infrastructure capable of accommodating this burgeoning demand necessitates a reevaluation of data center network design.

Network 124