Remove Infiniband Remove Network Remove Protocol
article thumbnail

How Meta trains large language models at scale

Engineering at Meta

Supporting GenAI at scale has meant rethinking how our software, hardware, and network infrastructure come together. Solving this problem requires a robust and high-speed network infrastructure as well as efficient data transfer protocols and algorithms. requires revisiting trade-offs made for other types of workloads.

article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

AI networks play an important role in interconnecting tens of thousands of GPUs together, forming the foundational infrastructure for training, enabling large models with hundreds of billions of parameters such as LLAMA 3.1 Distributed training, in particular, imposes the most significant strain on data center networking infrastructure.

Network 132