Remove Data Centers Remove Engineering Remove Topology
article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

Distributed training, in particular, imposes the most significant strain on data center networking infrastructure. Constructing a reliable, high-performance network infrastructure capable of accommodating this burgeoning demand necessitates a reevaluation of data center network design.

Network 132
article thumbnail

How Meta trains large language models at scale

Engineering at Meta

Data center deployment Once we’ve chosen a GPU and system, the task of placing them in a data center for optimal usage of resources (power, cooling, networking, etc.) We implemented collective communication patterns with network topology awareness so that they can be less latency-sensitive.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Building Meta’s GenAI Infrastructure

Engineering at Meta

Today, we’re sharing details on two versions of our 24,576-GPU data center scale cluster at Meta. Custom designing much of our own hardware, software, and network fabrics allows us to optimize the end-to-end experience for our AI researchers while ensuring our data centers operate efficiently.

article thumbnail

Announcing Complete Azure Observability for Kentik Cloud

Kentik

Kentik customers move workloads to (and from) multiple clouds, integrate existing hybrid applications with new cloud services, migrate to Virtual WAN to secure private network traffic, and make on-premises data and applications redundant to multiple clouds – or cloud data and applications redundant to the data center.

Cloud 105
article thumbnail

Massive Scale Visibility Challenges Inside Hyperscale Data Centers

Kentik

Hyperscale data centers are true marvels of the age of analytics, enabling a new era of cloud-scale computing that leverages Big Data, machine learning, cognitive computing and artificial intelligence. the compute capacity of these data centers is staggering.

article thumbnail

Today’s Enterprise WAN Isn’t What It Used To Be

Kentik

How it used to be When I started my career in networking, servers were down the hall or in the campus data center. Most resources were local, accessed remotely over some sort of leased line, or at worst, over a site-to-site back to the organization’s private data center. Yes, of course, I’m oversimplifying here.

WAN 98
article thumbnail

Maintaining large-scale AI capacity at Meta

Engineering at Meta

Meta is currently operating many data centers with GPU training clusters across the world. Our data centers are the backbone of our operations, meticulously designed to support the scaling demands of compute and storage. The post Maintaining large-scale AI capacity at Meta appeared first on Engineering at Meta.

Fashion 138