Remove Fashion Remove Protocol Remove Topology
article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

Topology We built a dedicated backend network specifically for distributed training. To support large language models (LLMs), we expanded the backend network towards the DC-scale, e.g., incorporating topology-awareness into the training job scheduler. We designed a two-stage Clos topology for AI racks, known as an AI Zone.

Network 132
article thumbnail

How Meta trains large language models at scale

Engineering at Meta

Optimal connectivity between GPUs: Large-scale model training involves transferring vast amounts of data between GPUs in a synchronized fashion. Solving this problem requires a robust and high-speed network infrastructure as well as efficient data transfer protocols and algorithms. Both of these options had tradeoffs.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Journey to Event Driven – Part 2: Programming Models for the Event-Driven Architecture

Confluent

Streams represent the core data model, and stream processors are the connecting nodes that enable flow creation resulting in a streaming data topology. The streaming topology shows a flow of data through an organization, representing the real-time DNA of your business. In some cases, it is desirable to hide these protocol concerns.

article thumbnail

Journey to Event Driven – Part 4: Four Pillars of Event Streaming Microservices

Confluent

The control plane becomes essential when outages have occurred and a large-scale system must get back online in a coordinated fashion, perhaps with incremental, restricted functionality. In the context of a payment system, you might need to drain the current set of payments for a topology change. AccountProcessor.java.