AWS RNG: random-graph topology, 33% more throughput, 50% fewer network devices

AWS detailed Random Network Graph (RNG), a datacenter network topology it has been quietly deploying since late last year, now live in Ireland, Germany, and Spain. The numbers: 33% throughput improvement, 50% fewer network devices, billions in estimated savings. The structural move is replacing the fat-tree (Clos) topology that has been the hyperscale default. Fat-tree constrains data flow between servers to limited paths, so congestion appears even when aggregate bandwidth is abundant. RNG increases available paths by laying some fiber segments in a deliberate pattern and others randomly — a production deployment of the randomized-topology idea that academic work (Jellyfish, expander-graph networks) has argued for over a decade. The research paper is at arXiv 2604.15261.

Two engineering pieces make the randomized topology practical, and both are the interesting part for builders. ShuffleBox is a custom passive device — it consumes no electricity — that physically cross-connects the fiber cables in RNG's randomized configuration; the no-power property matters because at datacenter scale, the cabling layer is normally either manual (error-prone) or powered (another failure domain and power draw). Spraypoint is the custom routing protocol: routers "spray" traffic to all neighboring routers, which then forward packets toward the destination, which is how you exploit the many available paths without the routing-table explosion that arbitrary mesh topologies normally cause. The combination — passive hardware for the physical layer, spray-routing for the logical layer — is what turns a theoretically-nice random graph into something operable at AWS scale.

The ecosystem read: randomized/expander datacenter topologies have been a research darling for years precisely because they beat fat-tree on path diversity per dollar, but they were operationally hard — cabling complexity and routing complexity were the blockers. AWS solving both with custom hardware plus a custom protocol is the signal that the theory is now production-viable at the largest scale. For AI training specifically, the implication is straightforward even though AWS did not spell it out: collective operations like all-reduce are bandwidth-bound and congestion-sensitive, so more non-congesting paths is exactly what large-model training fabrics want — though the announcement gives no AI-training-specific numbers and no head-to-head against NVIDIA InfiniBand or Google's Jupiter, which is the comparison the field actually needs. The honest caveats: the "billions saved" figure is AWS's own estimate, this is AWS-internal infrastructure (not a product you can buy or open hardware you can build), and the 33% is an aggregate throughput claim without the workload breakdown.

If you run your own datacenter fabric Monday morning: the arXiv paper (2604.15261) is worth reading for the ShuffleBox passive-crossconnect and Spraypoint spray-routing designs — the ideas are portable even if the hardware is not. If you are an AWS customer running training or large distributed workloads in eu-west-1 (Ireland) or the German/Spanish regions: this is throughput and reliability you inherit without changing anything. The structural news is that random-graph datacenter topology crossed from paper to hyperscale production — watch whether the design specifics in the paper get adopted by other operators or stay an AWS moat.

AWS RNG: random-graph topology, 33% more throughput, 50% fewer network devices

More News