article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

Topology We built a dedicated backend network specifically for distributed training. To support large language models (LLMs), we expanded the backend network towards the DC-scale, e.g., incorporating topology-awareness into the training job scheduler. We designed a two-stage Clos topology for AI racks, known as an AI Zone.

article thumbnail

How Meta trains large language models at scale

Engineering at Meta

Solving this problem requires a robust and high-speed network infrastructure as well as efficient data transfer protocols and algorithms. This requires robust and high-speed network infrastructure as well as efficient data transfer protocols and algorithms. This has encompassed developments in a wide range of areas.

Insiders

Sign Up for our Newsletter

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

article thumbnail

Reinforcing Networks: Advancing Resiliency and Redundancy Techniques

Kentik

Routing protocols and their impact on network resilience: The roles of IGP and BGP Let’s first dive into how routing protocols, particularly Interior Gateway Protocols (IGP) and Border Gateway Protocol (BGP) , can influence network resilience and help reduce the need for complete redundancy.

article thumbnail

Arcadia: An end-to-end AI system performance simulator

Engineering at Meta

Within these pillars, AI cluster performance can be influenced by multiple factors, including model parameters, workload distribution, job scheduler logic, topology, and hardware specs. We’re also investigating a framework to provide design insights for different topology/routing designs given a set of known models.

Topology 112
article thumbnail

Certification Internet service via iPerf3

Network Engineering

Attached is a topology diagram illustrating the proposed setup. Are GRE tunnels the best solution in this scenario, or would another tunneling or routing protocol be more effective? Request for Feedback: Has anyone implemented a similar setup or overcome a similar challenge?

article thumbnail

How to Configure Static Routes on Cisco

NW Kings

Unlike dynamic routes, learned through dynamic routing protocols such as OSPF (Open Shortest Path First) or EIGRP (Enhanced Interior Gateway Routing Protocol), static routes require the network administrator to specify the next hop or destination IP address. This feature in networks predicts and stabilizes the topology.

Routers 52
article thumbnail

Resilience and Redundancy in Networking

Kentik

This includes the ability to: Dynamically adjust to changes in network topology Detect and respond to outages Route around faults in order to maintain connectivity and service levels. While redundancy is a significant contributor to network resilience, other mechanisms, protocols, and methods can also contribute to overall network resilience.