Collectives Playground
The four communication primitives, animated hop by hop on a unidirectional ring (the book's animations run both directions, halving the hop count; the costs below use the bidirectional formulas).
Collective
AllGather [A_X, B] → [A, B]
Removes a sharding subscript: every device passes its shard around the ring until all devices hold every shard.
1 / 8
Each device starts with its own shard.
0
→1
→2
→3
→4
→5
→6
→7
→Each column is one device; each cell is one chunk of the array, coloured by identity. The ring wraps from the last device back to device 0.
Runtime comparison — bf16[2048, 8192] (33.6 MB) on a ring of 8 TPU v5e chips
| Collective | Runtime (bidirectional ring) | Time for this example |
|---|---|---|
| AllGather | bytes / W_ici | 373 µs |
| ReduceScatter | bytes / W_ici | 373 µs |
| AllReduce | 2 × bytes / W_ici | 746 µs |
| AllToAll | bytes / (4 × W_ici) | 93.2 µs |
Note the striking result from the chapter: in the bandwidth-bound regime these times do not depend on the number of devices — only on the array size and the per-link bandwidth (W_ici = 9e10 bytes/s bidirectional).