Remove Bandwidth Remove Data Centers Remove Topology
article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

Distributed training, in particular, imposes the most significant strain on data center networking infrastructure. Constructing a reliable, high-performance network infrastructure capable of accommodating this burgeoning demand necessitates a reevaluation of data center network design.

Network 132
article thumbnail

How Meta trains large language models at scale

Engineering at Meta

Data center deployment Once we’ve chosen a GPU and system, the task of placing them in a data center for optimal usage of resources (power, cooling, networking, etc.) We optimized the RoCE cluster for quick build time, and the InfiniBand cluster for full-bisection bandwidth.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Building Meta’s GenAI Infrastructure

Engineering at Meta

Today, we’re sharing details on two versions of our 24,576-GPU data center scale cluster at Meta. Custom designing much of our own hardware, software, and network fabrics allows us to optimize the end-to-end experience for our AI researchers while ensuring our data centers operate efficiently.

article thumbnail

Today’s Enterprise WAN Isn’t What It Used To Be

Kentik

How it used to be When I started my career in networking, servers were down the hall or in the campus data center. Most resources were local, accessed remotely over some sort of leased line, or at worst, over a site-to-site back to the organization’s private data center. Yes, of course, I’m oversimplifying here.

WAN 98
article thumbnail

Why is Cisco ACI replacing traditional networks?

The Network DNA

Cisco Application Centric Infrastructure (ACI) is a Next generation SDN solution and is designed for data centers spine-leaf architecture for the policy-driven solution. Cisco ACI provides application agility and data center automation with simplified operations. Adding spine switches increases fabric bandwidth.

Network 52
article thumbnail

Practical Steps for Enhancing Reliability in Cloud Networks - Part I

Kentik

However, arriving at specs for other aspects of network performance requires extensive monitoring, dashboarding, and data engineering to unify this data and help make it meaningful. Additionally, monitoring becomes critical for network optimizations by identifying areas where resources are under or overutilized.

Cloud 104
article thumbnail

The business case for SD-WAN: Because MPLS is Not Fit for the Cloud

CATO Networks

That means making sure the wide area network (WAN) that connects branch offices, data centers, cloud services and SaaS applications can handle the connectivity needs of digitally empowered global organizations. Particularly with SaaS, many business critical applications are no longer hosted in on-site data centers.

MPLS 52