article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

Distributed training, in particular, imposes the most significant strain on data center networking infrastructure. Constructing a reliable, high-performance network infrastructure capable of accommodating this burgeoning demand necessitates a reevaluation of data center network design.

article thumbnail

Massive Scale Visibility Challenges Inside Hyperscale Data Centers

Kentik

Hyperscale data centers are true marvels of the age of analytics, enabling a new era of cloud-scale computing that leverages Big Data, machine learning, cognitive computing and artificial intelligence. the compute capacity of these data centers is staggering.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

How Meta trains large language models at scale

Engineering at Meta

Data center deployment Once we’ve chosen a GPU and system, the task of placing them in a data center for optimal usage of resources (power, cooling, networking, etc.) We implemented collective communication patterns with network topology awareness so that they can be less latency-sensitive.

article thumbnail

Announcing Complete Azure Observability for Kentik Cloud

Kentik

Kentik customers move workloads to (and from) multiple clouds, integrate existing hybrid applications with new cloud services, migrate to Virtual WAN to secure private network traffic, and make on-premises data and applications redundant to multiple clouds – or cloud data and applications redundant to the data center.

Cloud 105
article thumbnail

Building Meta’s GenAI Infrastructure

Engineering at Meta

Today, we’re sharing details on two versions of our 24,576-GPU data center scale cluster at Meta. Custom designing much of our own hardware, software, and network fabrics allows us to optimize the end-to-end experience for our AI researchers while ensuring our data centers operate efficiently.

article thumbnail

Today’s Enterprise WAN Isn’t What It Used To Be

Kentik

How it used to be When I started my career in networking, servers were down the hall or in the campus data center. Most resources were local, accessed remotely over some sort of leased line, or at worst, over a site-to-site back to the organization’s private data center. Yes, of course, I’m oversimplifying here.

WAN 98
article thumbnail

Live Training: Build Your Own Networking Lab

Rule 11

The instructors will build a variety of network topologies, including data center and campus, to help learners understand how to test in different environments. The course begins with obtaining and starting the basic tools required to build and test network labs using open-source and freely available tools. Register here.