Fashion, Server and Topology - IT Networking Pro Today

Fashion

Server

Topology

Maintaining large-scale AI capacity at Meta

Engineering at Meta

JUNE 12, 2024

Meta runs different types of backend networks, topologies, and training jobs that have tight dependencies between software and hardware components. Instead, we ensure components are compatible with each other and roll component upgrades up in a sliding fashion. This transition has not been without its challenges.

Fashion

Fashion Data Centers Artificial Intelligence Server

How Meta trains large language models at scale

Engineering at Meta

JUNE 12, 2024

Optimal connectivity between GPUs: Large-scale model training involves transferring vast amounts of data between GPUs in a synchronized fashion. There are several reasons for this failure, but this failure mode is seen more in the early life and settles as the server ages.

Infiniband

Infiniband Data Centers Topology Network

Join 5,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

ByteByteGo

A RoCE network for distributed AI training at scale

Engineering at Meta

AUGUST 5, 2024

Topology We built a dedicated backend network specifically for distributed training. To support large language models (LLMs), we expanded the backend network towards the DC-scale, e.g., incorporating topology-awareness into the training job scheduler. We designed a two-stage Clos topology for AI racks, known as an AI Zone.

Network

Network Networking Topology Data Centers

Building Meta’s GenAI Infrastructure

Engineering at Meta

MARCH 12, 2024

This solution enables thousands of GPUs to save and load checkpoints in a synchronized fashion (a challenge for any storage solution) while also providing a flexible and high-throughput exabyte scale storage required for data loading. S SSD we can procure in the market today.

Infiniband

Infiniband Data Centers Server Network

Journey to Event Driven – Part 4: Four Pillars of Event Streaming Microservices

Confluent

MAY 9, 2019

The control plane becomes essential when outages have occurred and a large-scale system must get back online in a coordinated fashion, perhaps with incremental, restricted functionality. In the context of a payment system, you might need to drain the current set of payments for a topology change. AccountProcessor.java.

Application

Application Topology DevOps Fashion

SDN and Self-Driving Networks

Kentik

MAY 15, 2017

In a similar fashion, Facebook recently published a blog describing their Express Backbone that provides internal connectivity between datacenters. The system then dynamically provisions an MPLS LSP topology to meet the observed loads while optimizing for various traffic classes (e.g. latency sensitive vs. insensitive).

Network

Network Networking Fashion Topology

Journey to Event Driven – Part 2: Programming Models for the Event-Driven Architecture

Confluent

FEBRUARY 13, 2019

Streams represent the core data model, and stream processors are the connecting nodes that enable flow creation resulting in a streaming data topology. The streaming topology shows a flow of data through an organization, representing the real-time DNA of your business. Unlike the previous model, events are front and center.

Application

Application Protocol Topology Fashion

Maintaining large-scale AI capacity at Meta

How Meta trains large language models at scale

Trending Sources

A RoCE network for distributed AI training at scale

Building Meta’s GenAI Infrastructure

Journey to Event Driven – Part 4: Four Pillars of Event Streaming Microservices

SDN and Self-Driving Networks

Journey to Event Driven – Part 2: Programming Models for the Event-Driven Architecture

Stay Connected