Remove Fashion Remove Infiniband Remove Protocol
article thumbnail

How Meta trains large language models at scale

Engineering at Meta

Optimal connectivity between GPUs: Large-scale model training involves transferring vast amounts of data between GPUs in a synchronized fashion. Solving this problem requires a robust and high-speed network infrastructure as well as efficient data transfer protocols and algorithms. Both of these options had tradeoffs.

article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

This backend fabric utilizes the RoCEv2 protocol, which encapsulates the RDMA service in UDP packets for transport over the network. Initially, our GPU clusters used a simple star topology with a few AI racks connected to a central Ethernet switch running the non-routable RoCEv1 protocol.

Network 132