A RoCE network for distributed AI training at scale
Engineering at Meta
AUGUST 5, 2024
This backend fabric utilizes the RoCEv2 protocol, which encapsulates the RDMA service in UDP packets for transport over the network. Initially, our GPU clusters used a simple star topology with a few AI racks connected to a central Ethernet switch running the non-routable RoCEv1 protocol.
Let's personalize your content