article thumbnail

Maintaining large-scale AI capacity at Meta

Engineering at Meta

Instead, we ensure components are compatible with each other and roll component upgrades up in a sliding fashion. Maintenance trains Meta maintains capacity by using maintenance trains, which involves shutting down small amounts of capacity in a cyclic fashion. This approach also allows us to guarantee capacity availability.

Fashion 138
article thumbnail

How Meta trains large language models at scale

Engineering at Meta

Optimal connectivity between GPUs: Large-scale model training involves transferring vast amounts of data between GPUs in a synchronized fashion. There are several reasons for this failure, but this failure mode is seen more in the early life and settles as the server ages.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Building Meta’s GenAI Infrastructure

Engineering at Meta

This solution enables thousands of GPUs to save and load checkpoints in a synchronized fashion (a challenge for any storage solution) while also providing a flexible and high-throughput exabyte scale storage required for data loading. S SSD we can procure in the market today.

article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

The scheduler does this by learning the position of GPU servers in the logical topology to recommend a rank assignment. The second approach involved posting each message to a different queue, in a round-robin fashion. But it also produced smaller message sizes on fabric as well as multiple ACKs.

Network 132
article thumbnail

NetOps for Application Developers: Understanding the Importance of Network Operations in Modern Development

Kentik

Instead of an IT or infra team trying to manage the hydra of networks and configurations in a completely different fashion, cross-functional teams help ensure each service or development vertical has a NetOps representative from planning through deployment. Automated workflows Automation is critical to both DevOps and NetOps.

article thumbnail

Cloud Services are Eating the World

CATO Networks

Early indicators were abundant: Salesforce.com has displaced Siebel systems reducing the need for costly and customized implementations, Amazon AWS is increasingly displacing physical servers reducing the need for processors, cabinets, cabling, power and cooling. In my view, this observation is now obsolete.

Cloud 52
article thumbnail

Using Streams Replication Manager Prefixless Replication for Kafka Topic Aggregation

Cloudera Blog

It contains the name (alias), address (bootstrap servers), and credentials that SRM can use to access a specific cluster. The setup in this tutorial is minimal and unsecure, so you only need to configure Name, Bootstrap Servers, and Security Protocol lines. Click “Add Kafka Credentials.” Configure the credential.