Remove Bandwidth Remove Infiniband Remove Protocol
article thumbnail

How Meta trains large language models at scale

Engineering at Meta

Solving this problem requires a robust and high-speed network infrastructure as well as efficient data transfer protocols and algorithms. This requires robust and high-speed network infrastructure as well as efficient data transfer protocols and algorithms. This has encompassed developments in a wide range of areas.

article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

We ensure that there is enough ingress bandwidth on the rack switch to not hinder the training workload. The BE is a specialized fabric that connects all RDMA NICs in a non-blocking architecture, providing high bandwidth, low latency, and lossless transport between any two GPUs in the cluster, regardless of their physical location.

Network 124