Remove Infiniband Remove Server Remove UDP port
article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

The scheduler does this by learning the position of GPU servers in the logical topology to recommend a rank assignment. ECMP and path pinning We initially considered the widely adopted ECMP, which places flows randomly based on the hashes on the five-tuple: source and destination IPs, source and destination UDP ports, and protocol.

Network 132