Remove Application Remove Fashion Remove Topology
article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

Topology We built a dedicated backend network specifically for distributed training. To support large language models (LLMs), we expanded the backend network towards the DC-scale, e.g., incorporating topology-awareness into the training job scheduler. We designed a two-stage Clos topology for AI racks, known as an AI Zone.

Network 132
article thumbnail

Maintaining large-scale AI capacity at Meta

Engineering at Meta

Meta runs different types of backend networks, topologies, and training jobs that have tight dependencies between software and hardware components. Instead, we ensure components are compatible with each other and roll component upgrades up in a sliding fashion. This transition has not been without its challenges.

Fashion 138
Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Building Meta’s GenAI Infrastructure

Engineering at Meta

It played and continues to play an important role in the development of Llama and Llama 2 , as well as advanced AI models for applications ranging from computer vision, NLP, and speech recognition, to image generation , and even coding. Under the hood Our newer AI clusters build upon the successes and lessons learned from RSC.

article thumbnail

Journey to Event Driven – Part 4: Four Pillars of Event Streaming Microservices

Confluent

Storing events in a stream and connecting streams via stream processors provide a generic, data-centric, distributed application runtime that you can use to build ETL, event streaming applications, applications for recording metrics and anything else that has a real-time data requirement. Let’s explore what this really means.

article thumbnail

Journey to Event Driven – Part 2: Programming Models for the Event-Driven Architecture

Confluent

Although the principles behind event-driven frameworks are sound, those behind event sourcing, CQRS and hydrating application state are separate concerns so we often see them handled explicitly as an orthogonal concern (e.g., operational processes) or externally (think GitHub for your applications state). Scaling mechanism.

article thumbnail

SDN and Self-Driving Networks

Kentik

In a similar fashion, Facebook recently published a blog describing their Express Backbone that provides internal connectivity between datacenters. The system then dynamically provisions an MPLS LSP topology to meet the observed loads while optimizing for various traffic classes (e.g. latency sensitive vs. insensitive).

Network 40