article thumbnail

How Meta trains large language models at scale

Engineering at Meta

Optimal connectivity between GPUs: Large-scale model training involves transferring vast amounts of data between GPUs in a synchronized fashion. There are two leading choices in the industry that fit these requirements: RoCE and InfiniBand fabrics. A slow data exchange between a subset of GPUs can compound and slow down the whole job.

article thumbnail

Building Meta’s GenAI Infrastructure

Engineering at Meta

The other cluster features an NVIDIA Quantum2 InfiniBand fabric. Through careful co-design of the network, software, and model architectures, we have successfully used both RoCE and InfiniBand clusters for large, GenAI workloads (including our ongoing training of Llama 3 on our RoCE cluster) without any network bottlenecks.

article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

The second approach involved posting each message to a different queue, in a round-robin fashion. The first involved splitting each message meant to be posted over a single QP, instead onto multiple QPs resulting in multiple flows. But it also produced smaller message sizes on fabric as well as multiple ACKs.

Network 132