Remove Infiniband Remove Port Remove Topology
article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

Topology We built a dedicated backend network specifically for distributed training. To support large language models (LLMs), we expanded the backend network towards the DC-scale, e.g., incorporating topology-awareness into the training job scheduler. We designed a two-stage Clos topology for AI racks, known as an AI Zone.

Network 132