Remove Engineering Remove Server Remove Topology
article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

Topology We built a dedicated backend network specifically for distributed training. To support large language models (LLMs), we expanded the backend network towards the DC-scale, e.g., incorporating topology-awareness into the training job scheduler. We designed a two-stage Clos topology for AI racks, known as an AI Zone.

Network 132
article thumbnail

How Meta trains large language models at scale

Engineering at Meta

There are several reasons for this failure, but this failure mode is seen more in the early life and settles as the server ages. HW network cable: In the general category of unreachable servers, these failures are also seen most often in the early life of the server.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Building Meta’s GenAI Infrastructure

Engineering at Meta

Among other benefits, Hammerspace enables engineers to perform interactive debugging for jobs using thousands of GPUs as code changes are immediately accessible to all nodes within the environment. The post Building Meta’s GenAI Infrastructure appeared first on Engineering at Meta. S SSD we can procure in the market today.

article thumbnail

Today’s Enterprise WAN Isn’t What It Used To Be

Kentik

How it used to be When I started my career in networking, servers were down the hall or in the campus data center. Why do we need to create site-to-site VPNs or some sort of modern SD-WAN topology connecting all our branches when almost all traffic goes to the public internet and the cloud? Yes, of course, I’m oversimplifying here.

WAN 98
article thumbnail

Using Chakra execution traces for benchmarking and network performance optimization

Engineering at Meta

However, traditional full workload benchmarking presents several challenges: Difficulty in forecasting future system performance : When designing an AI system, engineers frequently face the challenge of predicting the performance of future systems. Our visualization tool can precisely highlight these imbalances, as shown by the below figure.

Network 109
article thumbnail

Maintaining large-scale AI capacity at Meta

Engineering at Meta

Meta runs different types of backend networks, topologies, and training jobs that have tight dependencies between software and hardware components. A small number of servers are taken out of production and maintained with all applicable upgrades. This transition has not been without its challenges.

Fashion 138
article thumbnail

KSQL Training for Hands-On Learning

Confluent

The production deployment lectures allow you to confidently scale a cluster, visualize a topology and demonstrate resilience in a multi-server configuration. A data engineer architect from Sydney, Australia, he lives with his wife, two kids, and a grumpy cat. The course consists of 33 lectures in total.