article thumbnail

How Uber Built Odin to Handle 3.8 Million Containers

ByteByteGo

The key advantages of Grail are as follows: Operates across tens of thousands of hosts in multiple data centers and cloud regions. Unlike traditional database monitoring tools that operate per data center, Grail aggregates data across all Uber locations. Works with all storage technologies managed by Odin.

article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

Distributed training, in particular, imposes the most significant strain on data center networking infrastructure. Constructing a reliable, high-performance network infrastructure capable of accommodating this burgeoning demand necessitates a reevaluation of data center network design.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

article thumbnail

Seamless network integration: connecting OpenShift to your data center with Apstra

Juniper

Official Juniper Networks Blogs Seamless network integration: connecting OpenShift to your data center with Apstra In today’s fast-paced digital world, businesses demand agility andefficiency from their IT infrastructure. Furthermore, EDA actively monitors the Kubernetes API for SR-IOV-bound pod creation and deletion.

article thumbnail

Seamless network integration: connecting OpenShift to your data center with Apstra

Juniper

Official Juniper Networks Blogs Seamless network integration: connecting OpenShift to your data center with Apstra In today’s fast-paced digital world, businesses demand agility andefficiency from their IT infrastructure. Furthermore, EDA actively monitors the Kubernetes API for SR-IOV-bound pod creation and deletion.

article thumbnail

Massive Scale Visibility Challenges Inside Hyperscale Data Centers

Kentik

Hyperscale data centers are true marvels of the age of analytics, enabling a new era of cloud-scale computing that leverages Big Data, machine learning, cognitive computing and artificial intelligence. the compute capacity of these data centers is staggering.

article thumbnail

How Meta trains large language models at scale

Engineering at Meta

Data center deployment Once we’ve chosen a GPU and system, the task of placing them in a data center for optimal usage of resources (power, cooling, networking, etc.) We implemented collective communication patterns with network topology awareness so that they can be less latency-sensitive.

article thumbnail

Announcing Complete Azure Observability for Kentik Cloud

Kentik

Kentik customers move workloads to (and from) multiple clouds, integrate existing hybrid applications with new cloud services, migrate to Virtual WAN to secure private network traffic, and make on-premises data and applications redundant to multiple clouds – or cloud data and applications redundant to the data center.

Cloud 105