Bandwidth, Fashion and Infiniband - IT Networking Pro Today

How Meta trains large language models at scale

Engineering at Meta

JUNE 12, 2024

Optimal connectivity between GPUs: Large-scale model training involves transferring vast amounts of data between GPUs in a synchronized fashion. There are two leading choices in the industry that fit these requirements: RoCE and InfiniBand fabrics. A slow data exchange between a subset of GPUs can compound and slow down the whole job.

Infiniband

Infiniband Data Centers Topology Network

A RoCE network for distributed AI training at scale

Engineering at Meta

AUGUST 5, 2024

We ensure that there is enough ingress bandwidth on the rack switch to not hinder the training workload. The BE is a specialized fabric that connects all RDMA NICs in a non-blocking architecture, providing high bandwidth, low latency, and lossless transport between any two GPUs in the cluster, regardless of their physical location.

Network

Network Networking Topology Data Centers

Building Meta’s GenAI Infrastructure

Engineering at Meta

MARCH 12, 2024

The other cluster features an NVIDIA Quantum2 InfiniBand fabric. Through careful co-design of the network, software, and model architectures, we have successfully used both RoCE and InfiniBand clusters for large, GenAI workloads (including our ongoing training of Llama 3 on our RoCE cluster) without any network bottlenecks.

Infiniband

Infiniband Data Centers Server Network

How Meta trains large language models at scale

A RoCE network for distributed AI training at scale

Building Meta’s GenAI Infrastructure

Stay Connected