Remove Application Remove Bandwidth Remove Topology
article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

Topology We built a dedicated backend network specifically for distributed training. To support large language models (LLMs), we expanded the backend network towards the DC-scale, e.g., incorporating topology-awareness into the training job scheduler. We designed a two-stage Clos topology for AI racks, known as an AI Zone.

Network 132
article thumbnail

Why is my SaaS application so slow?

Kentik

Some users simply can’t operate in their job when an application becomes unavailable. That’s why keeping a proverbial finger on the pulse of application performance is generally worth the effort. Many popular SaaS applications are delivered from hundreds of locations around the world. But, it isn’t easy. Start with the desktop.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Independent Compliance and Security Assessment – Two Additions to the All-New Cato Management Application

CATO Networks

The new Cato Management Application that we announced today certainly brings a scalable, powerful interface. We enhanced security reporting with an all-new threats dashboard and opened up application performance with another new dashboard. Behind the Cato Management Application is a completely rearchitected backend.

article thumbnail

Today’s Enterprise WAN Isn’t What It Used To Be

Kentik

Yes, there’s something to say about how applications are written, but on the public internet side, we’ve seen a decrease in latency, cost, and a massive increase in available bandwidth. It requires new tools, skills, and an understanding of how application traffic travels over the internet. I know there are always exceptions.

WAN 98
article thumbnail

Practical Steps for Enhancing Reliability in Cloud Networks - Part I

Kentik

By collecting and analyzing network telemetry, including traffic flows, bandwidth usage, packet loss rates, and error rates, NetOps leverage monitoring to detect and diagnose potential bottlenecks, security threats, and other issues that can impact network reliability, often before end users even notice a problem.

Cloud 104
article thumbnail

Building Meta’s GenAI Infrastructure

Engineering at Meta

It played and continues to play an important role in the development of Llama and Llama 2 , as well as advanced AI models for applications ranging from computer vision, NLP, and speech recognition, to image generation , and even coding. Under the hood Our newer AI clusters build upon the successes and lessons learned from RSC.

article thumbnail

Using Chakra execution traces for benchmarking and network performance optimization

Engineering at Meta

Such predictions become even more complex when the compute engines aren’t ready or when changes in network topology and bandwidth become necessary. As a result, traces sourced from one system might not accurately simulate on another with a different GPU, network topology, and bandwidth.

Network 109