Remove Bandwidth Remove Engineering Remove Topology
article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

Topology We built a dedicated backend network specifically for distributed training. To support large language models (LLMs), we expanded the backend network towards the DC-scale, e.g., incorporating topology-awareness into the training job scheduler. We designed a two-stage Clos topology for AI racks, known as an AI Zone.

Network 132
article thumbnail

Using Chakra execution traces for benchmarking and network performance optimization

Engineering at Meta

However, traditional full workload benchmarking presents several challenges: Difficulty in forecasting future system performance : When designing an AI system, engineers frequently face the challenge of predicting the performance of future systems. Our visualization tool can precisely highlight these imbalances, as shown by the below figure.

Network 109
Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Today’s Enterprise WAN Isn’t What It Used To Be

Kentik

Yes, there’s something to say about how applications are written, but on the public internet side, we’ve seen a decrease in latency, cost, and a massive increase in available bandwidth. So what does this mean for today’s enterprise network engineer? This coincided with the advent of the public cloud like AWS, Azure, GCP, etc.

WAN 98
article thumbnail

Practical Steps for Enhancing Reliability in Cloud Networks - Part I

Kentik

More than anything, reliability becomes the principal challenge for network engineers working in and with the cloud. Even the most detailed reliability engineering can be easily undermined in an insecure network. While there is much to be said about cloud costs and performance , I want to focus this article primarily on reliability.

Cloud 104
article thumbnail

How Meta trains large language models at scale

Engineering at Meta

We optimized the RoCE cluster for quick build time, and the InfiniBand cluster for full-bisection bandwidth. We implemented collective communication patterns with network topology awareness so that they can be less latency-sensitive. The post How Meta trains large language models at scale appeared first on Engineering at Meta.

article thumbnail

Building Meta’s GenAI Infrastructure

Engineering at Meta

Among other benefits, Hammerspace enables engineers to perform interactive debugging for jobs using thousands of GPUs as code changes are immediately accessible to all nodes within the environment. The post Building Meta’s GenAI Infrastructure appeared first on Engineering at Meta.

article thumbnail

Engineering dependability and fault tolerance in a distributed system

High Scalability

This means a system that is not merely available but is also engineered with extensive redundant measures to continue to work as its users expect. reliability situations, where continuity of service is essential, with redundant elements continuously in-service, such as with airplane engines. This ensures reliability.