Remove Data Centers Remove Infiniband Remove Protocol
article thumbnail

How Meta trains large language models at scale

Engineering at Meta

A slow data exchange between a subset of GPUs can compound and slow down the whole job. Solving this problem requires a robust and high-speed network infrastructure as well as efficient data transfer protocols and algorithms. Network Large-scale model training involves transferring vast amounts of data quickly between GPUs.

article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

Distributed training, in particular, imposes the most significant strain on data center networking infrastructure. Constructing a reliable, high-performance network infrastructure capable of accommodating this burgeoning demand necessitates a reevaluation of data center network design.

Network 132