Fashion, Infiniband and Networking - IT Networking Pro Today

How Meta trains large language models at scale

Engineering at Meta

JUNE 12, 2024

Supporting GenAI at scale has meant rethinking how our software, hardware, and network infrastructure come together. Optimal connectivity between GPUs: Large-scale model training involves transferring vast amounts of data between GPUs in a synchronized fashion. requires revisiting trade-offs made for other types of workloads.

Infiniband

Infiniband Data Centers Topology Network

A RoCE network for distributed AI training at scale

Engineering at Meta

AUGUST 5, 2024

AI networks play an important role in interconnecting tens of thousands of GPUs together, forming the foundational infrastructure for training, enabling large models with hundreds of billions of parameters such as LLAMA 3.1 Distributed training, in particular, imposes the most significant strain on data center networking infrastructure.

Network

Network Networking Topology Data Centers

Building Meta’s GenAI Infrastructure

Engineering at Meta

MARCH 12, 2024

We are sharing details on the hardware, network, storage, design, performance, and software that help us extract high throughput and reliability for various AI workloads. Network At Meta, we handle hundreds of trillions of AI model executions per day. The other cluster features an NVIDIA Quantum2 InfiniBand fabric.

Infiniband

Infiniband Data Centers Server Network

How Meta trains large language models at scale

A RoCE network for distributed AI training at scale

Building Meta’s GenAI Infrastructure

Stay Connected