article thumbnail

HN748: How AI and HPC Are Changing Data Center Networks

Packet Pushers

On todays episode of Heavy Networking, Rob Sherwood joins us to discuss the impact that High Performance Computing (HPC)and artificial intelligence computing are having on data center network design. Theres also power and cooling issues, massive bandwidth requirements, and changes in how we. Thats the boring part.

article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

Distributed training, in particular, imposes the most significant strain on data center networking infrastructure. Constructing a reliable, high-performance network infrastructure capable of accommodating this burgeoning demand necessitates a reevaluation of data center network design.

Network 132
Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Watch Meta’s engineers discuss optimizing large-scale networks

Engineering at Meta

At Meta, we’ve found that these challenges broadly fall into three themes: 1.) Data center networking: Over the past decade, on the physical front, we have seen a rise in vendor-specific hardware that comes with heterogeneous feature and architecture sets (e.g., non-blocking architecture).

article thumbnail

OCP Summit 2024: The open future of networking hardware for AI

Engineering at Meta

In today’s world, where more and more data center infrastructure is being devoted to supporting new and emerging AI technologies, open hardware takes on an important role in assisting with disaggregation. Those ideas have made Meta’s data centers among the most sustainable and efficient in the world. version 1.2.0,

Network 117
article thumbnail

Meta’s open AI hardware vision

Engineering at Meta

Networking and bandwidth play an important role in ensuring the clusters’ performance. Our systems consist of a tightly integrated HPC compute system and an isolated high-bandwidth compute network that connects all our GPUs and domain-specific accelerators. Building AI clusters requires more than just GPUs.

Bandwidth 131
article thumbnail

Building Meta’s GenAI Infrastructure

Engineering at Meta

Today, we’re sharing details on two versions of our 24,576-GPU data center scale cluster at Meta. Custom designing much of our own hardware, software, and network fabrics allows us to optimize the end-to-end experience for our AI researchers while ensuring our data centers operate efficiently.

article thumbnail

SiTime product launch boosts efficiency of AI data centres

DCNN Magazine

The company states that this is the only single-chip timing product that delivers the most resilient performance for AI compute-nodes with high bandwidth and network synchronisation. SiTime is the only semiconductor company fully dedicated to developing innovative timing solutions required for the complex scaling of todays AI data centres.

Energy 109