Remove Bandwidth Remove Network Remove Networking
article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

AI networks play an important role in interconnecting tens of thousands of GPUs together, forming the foundational infrastructure for training, enabling large models with hundreds of billions of parameters such as LLAMA 3.1 Distributed training, in particular, imposes the most significant strain on data center networking infrastructure.

Network 132
article thumbnail

Watch Meta’s engineers discuss optimizing large-scale networks

Engineering at Meta

Managing network solutions amidst a growing scale inherently brings challenges around performance, deployment, and operational complexities. They present key ideas underpinning the FBOSS model that helped them build a stable and scalable network. non-blocking architecture).

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Hybrid vs. Multi-cloud: The Good, the Bad and the Network Observability Needed

Kentik

Outlined in light blue is the hybrid cloud which includes the on-premises network, as well as the virtual public cloud (VPC) in the AWS public cloud. This allows DevOps teams to configure the application to increase or decrease the amount of system capacity, like CPU, storage, memory and input/output bandwidth, all on-demand.

Cloud 124
article thumbnail

OCP Summit 2024: The open future of networking hardware for AI

Engineering at Meta

At Open Compute Project Summit (OCP) 2024, we’re sharing details about our next-generation network fabric for our AI training clusters. We’ve expanded our network hardware portfolio and are contributing two new disaggregated network fabrics and a new NIC to OCP.

Network 117
article thumbnail

HN748: How AI and HPC Are Changing Data Center Networks

Packet Pushers

On todays episode of Heavy Networking, Rob Sherwood joins us to discuss the impact that High Performance Computing (HPC)and artificial intelligence computing are having on data center network design. Theres also power and cooling issues, massive bandwidth requirements, and changes in how we. Thats the boring part.

article thumbnail

Network Speed vs. Bandwidth vs. Throughput: Understanding Network Performance Metrics

Obkio

Learn about the differences between network speed, bandwidth & throughput. Find out why your business should measure them and how!

article thumbnail

Best Practices for Enriching Network Telemetry to Support Network Observability

Kentik

Network observability is critical. You need the ability to answer any question about your network—across clouds, on-prem, edge locations, and user devices—quickly and easily. But network observability is not always easy. And even then, key questions— such as, Am I using my network resources effectively?

Network 105