article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

AI networks play an important role in interconnecting tens of thousands of GPUs together, forming the foundational infrastructure for training, enabling large models with hundreds of billions of parameters such as LLAMA 3.1 Distributed training, in particular, imposes the most significant strain on data center networking infrastructure.

Network 132
article thumbnail

Watch Meta’s engineers discuss optimizing large-scale networks

Engineering at Meta

Managing network solutions amidst a growing scale inherently brings challenges around performance, deployment, and operational complexities. They present key ideas underpinning the FBOSS model that helped them build a stable and scalable network. non-blocking architecture).

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Network Speed vs. Bandwidth vs. Throughput: Understanding Network Performance Metrics

Obkio

Learn about the differences between network speed, bandwidth & throughput. Find out why your business should measure them and how!

article thumbnail

Network Bandwidth vs. Capacity: What’s Slowing Down Your Network?

Obkio

Discover the key differences between network bandwidth and capacity, and how they impact your network performance. Learn how to monitor & measure them.

article thumbnail

Best 25 Network Bandwidth Monitoring Tools of 2025 (Home, Free & Professionals)

Obkio

Take control of your network bandwidth with this comprehensive list of the best 25 network bandwidth monitoring tools of 2025 for home users & IT pros!

article thumbnail

N4N002: Bandwidth and Latency Explained

Packet Pushers

In this episode of N Is For Networking, co-hosts Ethan Banks and Holly Metlitzky take a question from college student Douglas that turns into a ride on the networking highway as they navigate the lanes of bandwidth and latency. Read more »

article thumbnail

HN748: How AI and HPC Are Changing Data Center Networks

Packet Pushers

On todays episode of Heavy Networking, Rob Sherwood joins us to discuss the impact that High Performance Computing (HPC)and artificial intelligence computing are having on data center network design. Theres also power and cooling issues, massive bandwidth requirements, and changes in how we. Thats the boring part.