Remove Data Centers Remove Infiniband Remove Network
article thumbnail

Hedge 244: Networks for AI

Rule 11

What are the requirements for running AI workloads over a data center fabric? Why is InfiniBand so popular for building AI networks? Jeff Tantsura joins Tom Ammon and Russ White to discuss networks for AI workloads. link] download What are the requirements for running AI workloads over a data center fabric?

article thumbnail

How Meta trains large language models at scale

Engineering at Meta

Supporting GenAI at scale has meant rethinking how our software, hardware, and network infrastructure come together. Optimal connectivity between GPUs: Large-scale model training involves transferring vast amounts of data between GPUs in a synchronized fashion. requires revisiting trade-offs made for other types of workloads.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

AI networks play an important role in interconnecting tens of thousands of GPUs together, forming the foundational infrastructure for training, enabling large models with hundreds of billions of parameters such as LLAMA 3.1 Distributed training, in particular, imposes the most significant strain on data center networking infrastructure.

Network 132
article thumbnail

Building Meta’s GenAI Infrastructure

Engineering at Meta

We are sharing details on the hardware, network, storage, design, performance, and software that help us extract high throughput and reliability for various AI workloads. Today, we’re sharing details on two versions of our 24,576-GPU data center scale cluster at Meta. We use this cluster design for Llama 3 training.

article thumbnail

Top Tips for Debugging and Optimizing NVIDIA Networking Performance

Router-switch

In today’s high-speed networking world, optimizing and troubleshooting performance is crucial, especially with high-performance equipment like NVIDIA Infiniband switches. Whether you’re a data center admin or network engineer, mastering effective techniques is key.