Infiniband and Topology - IT Networking Pro Today

How Meta trains large language models at scale

Engineering at Meta

JUNE 12, 2024

There are two leading choices in the industry that fit these requirements: RoCE and InfiniBand fabrics. On the other hand, Meta had built research clusters with InfiniBand as large as 16K GPUs. So we decided to build both: two 24k clusters , one with RoCE and another with InfiniBand. Both of these options had tradeoffs.

Infiniband

Infiniband Data Centers Topology Networking

A RoCE network for distributed AI training at scale

Engineering at Meta

AUGUST 5, 2024

Topology We built a dedicated backend network specifically for distributed training. To support large language models (LLMs), we expanded the backend network towards the DC-scale, e.g., incorporating topology-awareness into the training job scheduler. We designed a two-stage Clos topology for AI racks, known as an AI Zone.

Networking

Networking Network Topology Data Centers

Building Meta’s GenAI Infrastructure

Engineering at Meta

MARCH 12, 2024

The other cluster features an NVIDIA Quantum2 InfiniBand fabric. Through careful co-design of the network, software, and model architectures, we have successfully used both RoCE and InfiniBand clusters for large, GenAI workloads (including our ongoing training of Llama 3 on our RoCE cluster) without any network bottlenecks.

Infiniband

Infiniband Data Centers Server Networking

How Meta trains large language models at scale

A RoCE network for distributed AI training at scale

Building Meta’s GenAI Infrastructure

Stay Connected