A RoCE network for distributed AI training at scale
Engineering at Meta
AUGUST 5, 2024
The CTSW has deep buffers statically divided over the ports in the chassis. The scheduler does this by learning the position of GPU servers in the logical topology to recommend a rank assignment. The spine tier, composed of modular cluster training switches (CTSW), provides scale-out connectivity among all racks in the cluster.
Let's personalize your content