Bandwidth, Fashion and Topology - IT Networking Pro Today

A RoCE network for distributed AI training at scale

Engineering at Meta

AUGUST 5, 2024

Topology We built a dedicated backend network specifically for distributed training. To support large language models (LLMs), we expanded the backend network towards the DC-scale, e.g., incorporating topology-awareness into the training job scheduler. We designed a two-stage Clos topology for AI racks, known as an AI Zone.

Network

Network Networking Topology Data Centers

How Meta trains large language models at scale

Engineering at Meta

JUNE 12, 2024

Optimal connectivity between GPUs: Large-scale model training involves transferring vast amounts of data between GPUs in a synchronized fashion. We optimized the RoCE cluster for quick build time, and the InfiniBand cluster for full-bisection bandwidth. Our intent was to build and learn from the operational experience.

Infiniband

Infiniband Data Centers Topology Network

Building Meta’s GenAI Infrastructure

Engineering at Meta

MARCH 12, 2024

This solution enables thousands of GPUs to save and load checkpoints in a synchronized fashion (a challenge for any storage solution) while also providing a flexible and high-throughput exabyte scale storage required for data loading. This helped push our large clusters to achieve great and expected performance just as our small clusters.

Infiniband

Infiniband Data Centers Server Network

A RoCE network for distributed AI training at scale

How Meta trains large language models at scale

Building Meta’s GenAI Infrastructure

Stay Connected