Fashion, Infiniband and Server - IT Networking Pro Today

How Meta trains large language models at scale

Engineering at Meta

JUNE 12, 2024

Optimal connectivity between GPUs: Large-scale model training involves transferring vast amounts of data between GPUs in a synchronized fashion. There are several reasons for this failure, but this failure mode is seen more in the early life and settles as the server ages. Both of these options had tradeoffs.

Infiniband

Infiniband Data Centers Topology Network

Building Meta’s GenAI Infrastructure

Engineering at Meta

MARCH 12, 2024

The other cluster features an NVIDIA Quantum2 InfiniBand fabric. Through careful co-design of the network, software, and model architectures, we have successfully used both RoCE and InfiniBand clusters for large, GenAI workloads (including our ongoing training of Llama 3 on our RoCE cluster) without any network bottlenecks.

Infiniband

Infiniband Data Centers Server Network

A RoCE network for distributed AI training at scale

Engineering at Meta

AUGUST 5, 2024

The scheduler does this by learning the position of GPU servers in the logical topology to recommend a rank assignment. The second approach involved posting each message to a different queue, in a round-robin fashion. But it also produced smaller message sizes on fabric as well as multiple ACKs.

Network

Network Networking Topology Data Centers

How Meta trains large language models at scale

Building Meta’s GenAI Infrastructure

A RoCE network for distributed AI training at scale

Stay Connected