Remove Ethernet Remove Infiniband Remove Topology
article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

Our paper, “ RDMA over Ethernet for Distributed AI Training at Meta Scale ,” provides the details on how we design, implement, and operate one of the world’s largest AI networks at scale. We opted for RDMA Over Converged Ethernet version 2 (RoCEv2) as the inter-node communication transport for the majority of our AI capacity.

Network 132
article thumbnail

Building Meta’s GenAI Infrastructure

Engineering at Meta

With this in mind, we built one cluster with a remote direct memory access (RDMA) over converged Ethernet (RoCE) network fabric solution based on the Arista 7800 with Wedge400 and Minipack2 OCP rack switches. The other cluster features an NVIDIA Quantum2 InfiniBand fabric. Both of these solutions interconnect 400 Gbps endpoints.