Bandwidth, Engineering and Topology - IT Networking Pro Today

A RoCE network for distributed AI training at scale

Engineering at Meta

AUGUST 5, 2024

Topology We built a dedicated backend network specifically for distributed training. To support large language models (LLMs), we expanded the backend network towards the DC-scale, e.g., incorporating topology-awareness into the training job scheduler. We designed a two-stage Clos topology for AI racks, known as an AI Zone.

Network

Network Networking Topology Data Centers

Using Chakra execution traces for benchmarking and network performance optimization

Engineering at Meta

SEPTEMBER 7, 2023

However, traditional full workload benchmarking presents several challenges: Difficulty in forecasting future system performance : When designing an AI system, engineers frequently face the challenge of predicting the performance of future systems. Our visualization tool can precisely highlight these imbalances, as shown by the below figure.

Network

Network Networking Topology Bandwidth

Today’s Enterprise WAN Isn’t What It Used To Be

Kentik

MARCH 13, 2023

Yes, there’s something to say about how applications are written, but on the public internet side, we’ve seen a decrease in latency, cost, and a massive increase in available bandwidth. So what does this mean for today’s enterprise network engineer? This coincided with the advent of the public cloud like AWS, Azure, GCP, etc.

WAN

WAN Wide Area Network Topology Internet

Practical Steps for Enhancing Reliability in Cloud Networks - Part I

Kentik

APRIL 4, 2023

More than anything, reliability becomes the principal challenge for network engineers working in and with the cloud. Even the most detailed reliability engineering can be easily undermined in an insecure network. While there is much to be said about cloud costs and performance , I want to focus this article primarily on reliability.

Cloud

Cloud Network Networking Bandwidth

How Meta trains large language models at scale

Engineering at Meta

JUNE 12, 2024

We optimized the RoCE cluster for quick build time, and the InfiniBand cluster for full-bisection bandwidth. We implemented collective communication patterns with network topology awareness so that they can be less latency-sensitive. The post How Meta trains large language models at scale appeared first on Engineering at Meta.

Infiniband

Infiniband Data Centers Topology Network

Building Meta’s GenAI Infrastructure

Engineering at Meta

MARCH 12, 2024

Among other benefits, Hammerspace enables engineers to perform interactive debugging for jobs using thousands of GPUs as code changes are immediately accessible to all nodes within the environment. The post Building Meta’s GenAI Infrastructure appeared first on Engineering at Meta.

Infiniband

Infiniband Data Centers Server Network

Engineering dependability and fault tolerance in a distributed system

High Scalability

FEBRUARY 19, 2021

This means a system that is not merely available but is also engineered with extensive redundant measures to continue to work as its users expect. reliability situations, where continuity of service is essential, with redundant elements continuously in-service, such as with airplane engines. This ensures reliability.

Engineering

Engineering Topology Protocol Networking

Certification Internet service via iPerf3

Network Engineering

JANUARY 8, 2025

Occasionally, customers report issues such as high latency or not achieving their subscribed bandwidth. To address these concerns, we certify the last-mile connection using iPerf3 for traffic and bandwidth analysis. Attached is a topology diagram illustrating the proposed setup.

Internet

Internet Bandwidth Topology Server

How to Configure Static Routes on Cisco

NW Kings

JANUARY 7, 2025

This feature in networks predicts and stabilizes the topology. Low Overhead : Static routes do not consume bandwidth for routing updates or require additional CPU resources to compute paths. NOTE : Join our Network Engineer Master’s Program today! Here is how you would do it- Router1(config)# ip route 192.168.2.0

Routers

Routers IP Address Protocol Topology

SNMP vs. NetFlow

Kentik

JANUARY 29, 2020

SNMP data is also used by network engineers to troubleshoot reported problems along with network architects to do things like capacity planning. The flow analytics are used to make decisions on how traffic is being sent or received to other internet-connected peers via traffic engineering and optimization.

Port

Port Network Networking Data Centers

Network observability: Hype or reality?

Kentik

AUGUST 30, 2021

The term has a literal engineering definition, that, in a nutshell, means the internal state of any system is knowable solely by external observation. Why is my bandwidth bill so high? The concept of observability has taken hold in the DevOps, SRE and application performance monitoring (APM) space.

Network

Network Networking DevOps Cloud

SNMP vs. Flow

Kentik

JANUARY 29, 2020

SNMP data is also used by network engineers to troubleshoot reported problems along with network architects to do things like capacity planning. The flow analytics are used to make decisions on how traffic is being sent or received to other internet-connected peers via traffic engineering and optimization.

Port

Port Network Networking Data Centers

Securing Your Network Against Attacks: Prevent, Detect, and Mitigate Cyberthreats

Kentik

MARCH 15, 2023

Cyberthreat strategies have evolved in step with modern cloud networks, often using cheap, virtualized cloud resources to exploit the threat surface topology I briefly described above. These attacks aim to overwhelm a service’s bandwidth capabilities with prohibitively high traffic volumes. Protocol-based.

Network

Network Networking Protocol IP Address

Built-In Multi-Region Replication with Confluent Platform 5.4-preview

Confluent

SEPTEMBER 16, 2019

However, in order to operate a reliable stretch cluster, datacenters must be relatively close to each other and have a very stable, low latency, and high-bandwidth connection among the DCs. datacenter topology. David Arthur is a software engineer on the Core Kafka Team at Confluent. This is sometimes referred to as a 2.5

Bandwidth

Bandwidth WAN Topology Networking

SD-WAN and Cloud Security

CATO Networks

MAY 6, 2018

Traditionally, enterprises configure their WAN in a classic hub-and-spoke topology, where users in sites access resources in headquarters or a datacenter. Bandwidth-intensive traffic, bound for the Internet and cloud, are backhauled across the MPLS WAN.

WAN

WAN Cloud Wide Area Network MPLS

IT Managers: Read This Before Leaving Your MPLS Provider

CATO Networks

APRIL 20, 2022

Maybe youre an IT manager or a network engineer. Youve been told to cut costs Its no secret that MPLS circuits cost a fortune often 3-4x the price of MPLS alternatives (like SD-WAN,) for only a fraction of the bandwidth. Its about a year before your MPLS contract expires, and youve been told to cut costs by your CFO.

MPLS

MPLS WAN SASE Network

Top Ten Technology Trends for 2024

Vedcraft

JUNE 16, 2024

While tools and technologies play the enablement role, engineering practices & organizational culture for the developers well-being play the pivotal role. Team Topologies approach to organizing software engineering teams has emerged as a great reference for building an effective platform engineering team.

Cloud

Cloud Engineering Application Data Centers

When Reliability Goes Wrong in Cloud Networks

Kentik

MAY 31, 2023

In this article, I want to underscore why NetOps has an integral role (and more responsibility) in delivering on the promise of reliability and highlight a few examples of how engineering for reliability can make networks less reliable. This is no small feat and can lead to significant overhead and resource consumption.

Network

Network Networking Cloud Routers

Building and deploying MySQL Raft at Meta

Engineering at Meta

MAY 16, 2023

Over the last few years, we have implemented MySQL Raft, a Raft consensus engine that was integrated with MySQL to build a replicated state machine. MySQL Raft replication topologies A Raft ring would consist of several MySQL instances (four in the diagram) in different regions. During apply, a new replica-only binlog is created.

Engineering

Engineering Protocol Server Topology

IT Networking Pro Today

A RoCE network for distributed AI training at scale

Using Chakra execution traces for benchmarking and network performance optimization

Trending Sources

Today’s Enterprise WAN Isn’t What It Used To Be

Practical Steps for Enhancing Reliability in Cloud Networks - Part I

How Meta trains large language models at scale

Building Meta’s GenAI Infrastructure

Engineering dependability and fault tolerance in a distributed system

Certification Internet service via iPerf3

How to Configure Static Routes on Cisco

SNMP vs. NetFlow

Network observability: Hype or reality?

SNMP vs. Flow

Securing Your Network Against Attacks: Prevent, Detect, and Mitigate Cyberthreats

Built-In Multi-Region Replication with Confluent Platform 5.4-preview

SD-WAN and Cloud Security

IT Managers: Read This Before Leaving Your MPLS Provider

Top Ten Technology Trends for 2024

When Reliability Goes Wrong in Cloud Networks

Building and deploying MySQL Raft at Meta

Stay Connected