Remove Infiniband Remove Protocol Remove Server
article thumbnail

How Meta trains large language models at scale

Engineering at Meta

Solving this problem requires a robust and high-speed network infrastructure as well as efficient data transfer protocols and algorithms. There are several reasons for this failure, but this failure mode is seen more in the early life and settles as the server ages. This has encompassed developments in a wide range of areas.

article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

This backend fabric utilizes the RoCEv2 protocol, which encapsulates the RDMA service in UDP packets for transport over the network. Initially, our GPU clusters used a simple star topology with a few AI racks connected to a central Ethernet switch running the non-routable RoCEv1 protocol.

Network 132