Remove Infiniband Remove Protocol Remove UDP port
article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

This backend fabric utilizes the RoCEv2 protocol, which encapsulates the RDMA service in UDP packets for transport over the network. Initially, our GPU clusters used a simple star topology with a few AI racks connected to a central Ethernet switch running the non-routable RoCEv1 protocol.

Network 132