Remove Bandwidth Remove Protocol Remove Topology
article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

Topology We built a dedicated backend network specifically for distributed training. To support large language models (LLMs), we expanded the backend network towards the DC-scale, e.g., incorporating topology-awareness into the training job scheduler. We designed a two-stage Clos topology for AI racks, known as an AI Zone.

Network 132
article thumbnail

How Meta trains large language models at scale

Engineering at Meta

Solving this problem requires a robust and high-speed network infrastructure as well as efficient data transfer protocols and algorithms. This requires robust and high-speed network infrastructure as well as efficient data transfer protocols and algorithms. This has encompassed developments in a wide range of areas.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Certification Internet service via iPerf3

Network Engineering

Occasionally, customers report issues such as high latency or not achieving their subscribed bandwidth. To address these concerns, we certify the last-mile connection using iPerf3 for traffic and bandwidth analysis. Attached is a topology diagram illustrating the proposed setup.

article thumbnail

Why latency is the new outage

Kentik

Not as difficult as time travel, but it’s difficult enough so that for 30+ years IT professionals have tried to skirt the issue by adding more bandwidth between locations or by rolling out faster routers and switches. Over the last few decades network managers have focused on adding bandwidth and reducing the network outages.

TCP 116
article thumbnail

How to Configure Static Routes on Cisco

NW Kings

Unlike dynamic routes, learned through dynamic routing protocols such as OSPF (Open Shortest Path First) or EIGRP (Enhanced Interior Gateway Routing Protocol), static routes require the network administrator to specify the next hop or destination IP address. This feature in networks predicts and stabilizes the topology.

Routers 52
article thumbnail

Practical Steps for Enhancing Reliability in Cloud Networks - Part I

Kentik

By collecting and analyzing network telemetry, including traffic flows, bandwidth usage, packet loss rates, and error rates, NetOps leverage monitoring to detect and diagnose potential bottlenecks, security threats, and other issues that can impact network reliability, often before end users even notice a problem.

Cloud 104
article thumbnail

Securing Your Network Against Attacks: Prevent, Detect, and Mitigate Cyberthreats

Kentik

Cyberthreat strategies have evolved in step with modern cloud networks, often using cheap, virtualized cloud resources to exploit the threat surface topology I briefly described above. These attacks aim to overwhelm a service’s bandwidth capabilities with prohibitively high traffic volumes. Protocol-based. RPKI status checks.

Network 94