Fashion, Infiniband and Protocol - IT Networking Pro Today

Search:

DAY

WEEK

MONTH

YEAR

Select your country:
Sign up | Log in

Fashion

Infiniband

Protocol

How Meta trains large language models at scale

Engineering at Meta

JUNE 12, 2024

Optimal connectivity between GPUs: Large-scale model training involves transferring vast amounts of data between GPUs in a synchronized fashion. Solving this problem requires a robust and high-speed network infrastructure as well as efficient data transfer protocols and algorithms. Both of these options had tradeoffs.

Infiniband

Infiniband Data Centers Topology Network

A RoCE network for distributed AI training at scale

Engineering at Meta

AUGUST 5, 2024

This backend fabric utilizes the RoCEv2 protocol, which encapsulates the RDMA service in UDP packets for transport over the network. Initially, our GPU clusters used a simple star topology with a few AI racks connected to a central Ethernet switch running the non-routable RoCEv1 protocol.

Network

Network Networking Topology Data Centers

How Meta trains large language models at scale

A RoCE network for distributed AI training at scale

Stay Connected