article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

Topology We built a dedicated backend network specifically for distributed training. To support large language models (LLMs), we expanded the backend network towards the DC-scale, e.g., incorporating topology-awareness into the training job scheduler. We designed a two-stage Clos topology for AI racks, known as an AI Zone.

Network 132
article thumbnail

Certification Internet service via iPerf3

Network Engineering

Occasionally, customers report issues such as high latency or not achieving their subscribed bandwidth. To address these concerns, we certify the last-mile connection using iPerf3 for traffic and bandwidth analysis. Attached is a topology diagram illustrating the proposed setup.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

How Meta trains large language models at scale

Engineering at Meta

We optimized the RoCE cluster for quick build time, and the InfiniBand cluster for full-bisection bandwidth. We implemented collective communication patterns with network topology awareness so that they can be less latency-sensitive. Our intent was to build and learn from the operational experience.

article thumbnail

Using Chakra execution traces for benchmarking and network performance optimization

Engineering at Meta

Such predictions become even more complex when the compute engines aren’t ready or when changes in network topology and bandwidth become necessary. As a result, traces sourced from one system might not accurately simulate on another with a different GPU, network topology, and bandwidth.

Network 109
article thumbnail

Why latency is the new outage

Kentik

Not as difficult as time travel, but it’s difficult enough so that for 30+ years IT professionals have tried to skirt the issue by adding more bandwidth between locations or by rolling out faster routers and switches. Over the last few decades network managers have focused on adding bandwidth and reducing the network outages.

TCP 116
article thumbnail

Today’s Enterprise WAN Isn’t What It Used To Be

Kentik

Yes, there’s something to say about how applications are written, but on the public internet side, we’ve seen a decrease in latency, cost, and a massive increase in available bandwidth. This coincided with the advent of the public cloud like AWS, Azure, GCP, etc. Yes, of course, I’m oversimplifying here. I know there are always exceptions.

WAN 98
article thumbnail

Practical Steps for Enhancing Reliability in Cloud Networks - Part I

Kentik

By collecting and analyzing network telemetry, including traffic flows, bandwidth usage, packet loss rates, and error rates, NetOps leverage monitoring to detect and diagnose potential bottlenecks, security threats, and other issues that can impact network reliability, often before end users even notice a problem.

Cloud 104