Bandwidth, Fashion and Server - IT Networking Pro Today

A RoCE network for distributed AI training at scale

Engineering at Meta

AUGUST 5, 2024

We ensure that there is enough ingress bandwidth on the rack switch to not hinder the training workload. The BE is a specialized fabric that connects all RDMA NICs in a non-blocking architecture, providing high bandwidth, low latency, and lossless transport between any two GPUs in the cluster, regardless of their physical location.

Network

Network Networking Topology Data Centers

Building Meta’s GenAI Infrastructure

Engineering at Meta

MARCH 12, 2024

This solution enables thousands of GPUs to save and load checkpoints in a synchronized fashion (a challenge for any storage solution) while also providing a flexible and high-throughput exabyte scale storage required for data loading. S SSD we can procure in the market today. After we optimize the full system (software, network, etc.),

Infiniband

Infiniband Data Centers Server Network

Join 5,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

ByteByteGo

How Meta trains large language models at scale

Engineering at Meta

JUNE 12, 2024

Optimal connectivity between GPUs: Large-scale model training involves transferring vast amounts of data between GPUs in a synchronized fashion. There are several reasons for this failure, but this failure mode is seen more in the early life and settles as the server ages.

Infiniband

Infiniband Data Centers Topology Network

Egnyte Architecture: Lessons learned in building and scaling a multi petabyte content platform

High Scalability

NOVEMBER 25, 2019

Large files or low bandwidth. Application Servers. Apache FTP server. Native desktop and server hosted clients that allow both interactive as well as hybrid sync access to the entire namespace. Tens of petabytes of data stored in our servers and other object stores such as GCS, S3 and Azure Blobstore. Permissions.

Data Centers

Data Centers Server Engineering Cloud

A RoCE network for distributed AI training at scale

Building Meta’s GenAI Infrastructure

Trending Sources

How Meta trains large language models at scale

Egnyte Architecture: Lessons learned in building and scaling a multi petabyte content platform

Stay Connected