Remove Bandwidth Remove Infiniband Remove Server
article thumbnail

How Meta trains large language models at scale

Engineering at Meta

There are several reasons for this failure, but this failure mode is seen more in the early life and settles as the server ages. HW network cable: In the general category of unreachable servers, these failures are also seen most often in the early life of the server. Both of these options had tradeoffs.

article thumbnail

Building Meta’s GenAI Infrastructure

Engineering at Meta

The other cluster features an NVIDIA Quantum2 InfiniBand fabric. Through careful co-design of the network, software, and model architectures, we have successfully used both RoCE and InfiniBand clusters for large, GenAI workloads (including our ongoing training of Llama 3 on our RoCE cluster) without any network bottlenecks.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

We ensure that there is enough ingress bandwidth on the rack switch to not hinder the training workload. The BE is a specialized fabric that connects all RDMA NICs in a non-blocking architecture, providing high bandwidth, low latency, and lossless transport between any two GPUs in the cluster, regardless of their physical location.

Network 132
article thumbnail

Top Tips for Debugging and Optimizing NVIDIA Networking Performance

Router-switch

In today’s high-speed networking world, optimizing and troubleshooting performance is crucial, especially with high-performance equipment like NVIDIA Infiniband switches. In this blog, we’ll share top tips for debugging and optimizing NVIDIA Infiniband networking performance.