article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

We ensure that there is enough ingress bandwidth on the rack switch to not hinder the training workload. The BE is a specialized fabric that connects all RDMA NICs in a non-blocking architecture, providing high bandwidth, low latency, and lossless transport between any two GPUs in the cluster, regardless of their physical location.

article thumbnail

Building Meta’s GenAI Infrastructure

Engineering at Meta

This solution enables thousands of GPUs to save and load checkpoints in a synchronized fashion (a challenge for any storage solution) while also providing a flexible and high-throughput exabyte scale storage required for data loading. This helped push our large clusters to achieve great and expected performance just as our small clusters.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

How Meta trains large language models at scale

Engineering at Meta

Optimal connectivity between GPUs: Large-scale model training involves transferring vast amounts of data between GPUs in a synchronized fashion. We optimized the RoCE cluster for quick build time, and the InfiniBand cluster for full-bisection bandwidth. Our intent was to build and learn from the operational experience.

article thumbnail

The WAN Accelerator and Modern Network Optimization

CATO Networks

While WAN optimization and acceleration are still important, increased bandwidth availability, cloud, and mobile have significantly shifted the paradigm. What is a WAN accelerator Simply put, a WAN accelerator is any hardware or software appliance that provides bandwidth optimization across a WAN. Here, well answer those questions.

WAN 52
article thumbnail

Chris Maher: Employee of the Quarter!

Akins IT

I think it's old fashioned wiring to just show up at the office every day. I also had a full house going on with a family of six which made decent internet bandwidth hard to come by. For the first couple of months of quarantine, I still went in to the office every day and worked there solo.

Fashion 52
article thumbnail

Meeting DoorDash Growth with a Self-Service Logistics Configuration Platform 

DoorDash Engineering

Second, approvals would consume far too much bandwidth from platform team engineers. All current platform clients are T0 service, have fallbacks in place, and can perform in a degraded fashion if the platform becomes unavailable. First, the platform team doesn’t have insights about the type of changes being requested.

article thumbnail

Egnyte Architecture: Lessons learned in building and scaling a multi petabyte content platform

High Scalability

Large files or low bandwidth. At this point, we store 1 DR copy in public cloud and 1 copy with us but eventually we will use our data center as a pass-through cache as compute is cheaper in the cloud but bandwidth is expensive. We have cache filers nodes based on tomcat/Nginx/local file system and it acts in LRU fashion.