Remove Data Centers Remove Networking Remove Topology
article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

AI networks play an important role in interconnecting tens of thousands of GPUs together, forming the foundational infrastructure for training, enabling large models with hundreds of billions of parameters such as LLAMA 3.1 Distributed training, in particular, imposes the most significant strain on data center networking infrastructure.

Network 124
article thumbnail

Seamless network integration: connecting OpenShift to your data center with Apstra

Juniper

Official Juniper Networks Blogs Seamless network integration: connecting OpenShift to your data center with Apstra In today’s fast-paced digital world, businesses demand agility andefficiency from their IT infrastructure. The most commonly deployed templates set up a cloud-scale EVPN-VXLAN fabric.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Seamless network integration: connecting OpenShift to your data center with Apstra

Juniper

Official Juniper Networks Blogs Seamless network integration: connecting OpenShift to your data center with Apstra In today’s fast-paced digital world, businesses demand agility andefficiency from their IT infrastructure. The most commonly deployed templates set up a cloud-scale EVPN-VXLAN fabric.

article thumbnail

Kentik Bridges the Intelligence Gap for Hybrid Cloud Networks

Kentik

As Kentik’s product manager for hybrid cloud, I am always talking to infrastructure and network teams around the world to understand a day-in-their-life. This provides me with an invaluable understanding of the challenges, goals, and priorities that they face today, and also a vision into their future network monitoring needs.

Cloud 105
article thumbnail

Announcing Complete Azure Observability for Kentik Cloud

Kentik

Kentik customers move workloads to (and from) multiple clouds, integrate existing hybrid applications with new cloud services, migrate to Virtual WAN to secure private network traffic, and make on-premises data and applications redundant to multiple clouds – or cloud data and applications redundant to the data center.

Cloud 105
article thumbnail

How Meta trains large language models at scale

Engineering at Meta

Supporting GenAI at scale has meant rethinking how our software, hardware, and network infrastructure come together. Optimal connectivity between GPUs: Large-scale model training involves transferring vast amounts of data between GPUs in a synchronized fashion. requires revisiting trade-offs made for other types of workloads.

article thumbnail

Building Meta’s GenAI Infrastructure

Engineering at Meta

We are sharing details on the hardware, network, storage, design, performance, and software that help us extract high throughput and reliability for various AI workloads. Today, we’re sharing details on two versions of our 24,576-GPU data center scale cluster at Meta. We use this cluster design for Llama 3 training.