A RoCE network for distributed AI training at scale
Engineering at Meta
AUGUST 5, 2024
The scheduler does this by learning the position of GPU servers in the logical topology to recommend a rank assignment. ECMP and path pinning We initially considered the widely adopted ECMP, which places flows randomly based on the hashes on the five-tuple: source and destination IPs, source and destination UDP ports, and protocol.
Let's personalize your content