article thumbnail

Critically Engaging With Models

Mathias Verraes

Then, well examine three organisational models that are specific to software development: the Spotify Model, the Agile Fluency Model, and Team Topologies. Team Topologies Image: Henny Portman Team Topologies is a software organisational model that focuses on fast flow and value creation.

Topology 162
article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

Topology We built a dedicated backend network specifically for distributed training. To support large language models (LLMs), we expanded the backend network towards the DC-scale, e.g., incorporating topology-awareness into the training job scheduler. We designed a two-stage Clos topology for AI racks, known as an AI Zone.

Network 132
Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Maintaining large-scale AI capacity at Meta

Engineering at Meta

Meta runs different types of backend networks, topologies, and training jobs that have tight dependencies between software and hardware components. Instead, we ensure components are compatible with each other and roll component upgrades up in a sliding fashion. This transition has not been without its challenges.

Fashion 138
article thumbnail

How Meta trains large language models at scale

Engineering at Meta

Optimal connectivity between GPUs: Large-scale model training involves transferring vast amounts of data between GPUs in a synchronized fashion. We implemented collective communication patterns with network topology awareness so that they can be less latency-sensitive.

article thumbnail

Building Meta’s GenAI Infrastructure

Engineering at Meta

This solution enables thousands of GPUs to save and load checkpoints in a synchronized fashion (a challenge for any storage solution) while also providing a flexible and high-throughput exabyte scale storage required for data loading.

article thumbnail

Journey to Event Driven – Part 4: Four Pillars of Event Streaming Microservices

Confluent

The control plane becomes essential when outages have occurred and a large-scale system must get back online in a coordinated fashion, perhaps with incremental, restricted functionality. In the context of a payment system, you might need to drain the current set of payments for a topology change. AccountProcessor.java.

article thumbnail

SDN and Self-Driving Networks

Kentik

In a similar fashion, Facebook recently published a blog describing their Express Backbone that provides internal connectivity between datacenters. The system then dynamically provisions an MPLS LSP topology to meet the observed loads while optimizing for various traffic classes (e.g. latency sensitive vs. insensitive).

Network 40