Remove Infiniband Remove Server Remove Topology
article thumbnail

How Meta trains large language models at scale

Engineering at Meta

There are several reasons for this failure, but this failure mode is seen more in the early life and settles as the server ages. HW network cable: In the general category of unreachable servers, these failures are also seen most often in the early life of the server. Both of these options had tradeoffs.

article thumbnail

Building Meta’s GenAI Infrastructure

Engineering at Meta

The other cluster features an NVIDIA Quantum2 InfiniBand fabric. Through careful co-design of the network, software, and model architectures, we have successfully used both RoCE and InfiniBand clusters for large, GenAI workloads (including our ongoing training of Llama 3 on our RoCE cluster) without any network bottlenecks.

article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

Topology We built a dedicated backend network specifically for distributed training. To support large language models (LLMs), we expanded the backend network towards the DC-scale, e.g., incorporating topology-awareness into the training job scheduler. We designed a two-stage Clos topology for AI racks, known as an AI Zone.

Network 132