Collectives Playground

The four communication primitives, animated hop by hop on a unidirectional ring (the book's animations run both directions, halving the hop count; the costs below use the bidirectional formulas).

Collective
AllGather [A_X, B] → [A, B]

Removes a sharding subscript: every device passes its shard around the ring until all devices hold every shard.

1 / 8

Each device starts with its own shard.

0
1
2
3
4
5
6
7

Each column is one device; each cell is one chunk of the array, coloured by identity. The ring wraps from the last device back to device 0.

Runtime comparison — bf16[2048, 8192] (33.6 MB) on a ring of 8 TPU v5e chips

CollectiveRuntime (bidirectional ring)Time for this example
AllGatherbytes / W_ici373 µs
ReduceScatterbytes / W_ici373 µs
AllReduce2 × bytes / W_ici746 µs
AllToAllbytes / (4 × W_ici)93.2 µs

Note the striking result from the chapter: in the bandwidth-bound regime these times do not depend on the number of devices — only on the array size and the per-link bandwidth (W_ici = 9e10 bytes/s bidirectional).