Infiniband and Server - IT Networking Pro Today

How Meta trains large language models at scale

Engineering at Meta

JUNE 12, 2024

There are several reasons for this failure, but this failure mode is seen more in the early life and settles as the server ages. HW network cable: In the general category of unreachable servers, these failures are also seen most often in the early life of the server. Both of these options had tradeoffs.

Infiniband

Infiniband Data Centers Topology Networking

Building Meta’s GenAI Infrastructure

Engineering at Meta

MARCH 12, 2024

The other cluster features an NVIDIA Quantum2 InfiniBand fabric. Through careful co-design of the network, software, and model architectures, we have successfully used both RoCE and InfiniBand clusters for large, GenAI workloads (including our ongoing training of Llama 3 on our RoCE cluster) without any network bottlenecks.

Infiniband

Infiniband Data Centers Server Networking

Join 5,000+

Insiders

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Trending Sources

ByteByteGo

Top Tips for Debugging and Optimizing NVIDIA Networking Performance

Router-switch

SEPTEMBER 25, 2024

In today’s high-speed networking world, optimizing and troubleshooting performance is crucial, especially with high-performance equipment like NVIDIA Infiniband switches. In this blog, we’ll share top tips for debugging and optimizing NVIDIA Infiniband networking performance.

Infiniband

Infiniband Networking Network Routers

A RoCE network for distributed AI training at scale

Engineering at Meta

AUGUST 5, 2024

The scheduler does this by learning the position of GPU servers in the logical topology to recommend a rank assignment. Routing The scaling of compute power and network topology discussed above led to the question of how to efficiently balance and route the massive training traffic. Thus, we took two steps to improve the performance.

Networking

Networking Network Topology Data Centers

How Meta trains large language models at scale

Building Meta’s GenAI Infrastructure

Trending Sources

Top Tips for Debugging and Optimizing NVIDIA Networking Performance

A RoCE network for distributed AI training at scale

Stay Connected