Remove Data Centers Remove Ethernet Remove Infiniband
article thumbnail

Hedge 244: Networks for AI

Rule 11

What are the requirements for running AI workloads over a data center fabric? Why is InfiniBand so popular for building AI networks? What about Ethernet for AI? link] download What are the requirements for running AI workloads over a data center fabric? Why is InfiniBand so popular for building AI networks?

article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

Our paper, “ RDMA over Ethernet for Distributed AI Training at Meta Scale ,” provides the details on how we design, implement, and operate one of the world’s largest AI networks at scale. Distributed training, in particular, imposes the most significant strain on data center networking infrastructure.

Network 132
article thumbnail

Building Meta’s GenAI Infrastructure

Engineering at Meta

Today, we’re sharing details on two versions of our 24,576-GPU data center scale cluster at Meta. Custom designing much of our own hardware, software, and network fabrics allows us to optimize the end-to-end experience for our AI researchers while ensuring our data centers operate efficiently.