Remove Infiniband Remove Port Remove Server
article thumbnail

A RoCE network for distributed AI training at scale

Engineering at Meta

The CTSW has deep buffers statically divided over the ports in the chassis. The scheduler does this by learning the position of GPU servers in the logical topology to recommend a rank assignment. The spine tier, composed of modular cluster training switches (CTSW), provides scale-out connectivity among all racks in the cluster.

Network 132