OpenAI MRC fabric: 131K GPUs, no L3 routing, 8-plane spraying, lossy Ethernet, Zubnet AI News

A consortium of OpenAI, AMD, Broadcom, Intel, Microsoft and NVIDIA released MRC — Multipath Reliable Connection — through the Open Compute Project on May 5, with the accompanying research paper (Araujo et al., arXiv:2605.04333) detailing its deployment across OpenAI's largest GB200 supercomputers including the Stargate site with Oracle Cloud Infrastructure in Abilene, Texas, and Microsoft's Fairwater. MRC is the networking layer behind training runs for the latest ChatGPT and Codex frontier models, and Gokul Chandra Purnachandra Reddy's deep-read in Towards Data Science surfaces the load-bearing observation the press coverage missed: MRC effectively eliminates the entire Layer 3 control plane from the data center fabric. No OSPF, no BGP, no IS-IS, no FIB; switches maintain zero dynamic forwarding state. To Reddy's knowledge, this is the most aggressive elimination of dynamic routing in any production AI training fabric publicly documented to date.

The five counterintuitive design decisions, each individually familiar but radical in combination: (1) Split the 800 Gb/s NIC into eight 100 Gb/s links, each on its own switch — creates eight independent network planes. Two-tier topology supports 131,072 GPUs at full bisection bandwidth versus ~64K GPUs at three tiers conventionally. Worst-case path is 3 hops vs 5-7 hops. Uses 2/3 the optics and 3/5 the switches of a 3-tier deployment. (2) No dynamic routing protocols — static routes only, zero forwarding state, control plane simple enough that a small team can manage multiple supercomputers simultaneously. (3) Packet spraying: each transfer is sprayed across hundreds of random paths across the eight planes; when a link fails, the NIC retires that entropy value and redistributes traffic to the remaining seven planes in microseconds. (4) Lossy Ethernet by design — accept packet loss intentionally rather than building backpressure cascades, with selective retransmission handling the small loss rate. (5) ECN repurposed as a load-balancing signal rather than a congestion-control signal. 800 Gb/s NICs ship from three different silicon vendors.

The problem framing is what makes the engineering tradeoffs defensible. Synchronous pretraining at 131,072 GPUs runs in lock-step — every training step depends on the slowest transfer. The paper's quoted framing: "as computations scale, communication becomes increasingly outlier-dominated." At ~$300,000/hour cloud rates for 100K H100-class GPUs, a 10ms tail-latency stall per step across thousands of steps compounds into real money. The production-incident anecdote is the part to weight: an optical transceiver on a T0 switch suffered a glitch and flapped all four links in rapid succession, affecting three active training nodes; in a conventional network this would have crashed the training job, and with MRC the training continued. The resilience math on link failures: 800 Gb/s single-plane NIC loses 3% capacity on one bad link; 100 Gb/s multi-plane loses 0.4% and continues operating on the remaining seven planes. The architecture buys predictable bandwidth at the cost of network monitoring complexity (8× the links to track) and a different mental model for ops teams who came up on conventional L3 fabrics.

For builders and infra teams: this is the most concrete data point yet on what frontier-lab training-fabric architecture has become, and the OCP release means you can study the protocol design rather than reverse-engineer it from job-listing keyword analysis. Three concrete implications. First, if you are buying capacity from a frontier-lab-adjacent cloud, expect MRC-style multi-plane fabrics to be the baseline by Q3 — your workload tuning assumptions about single-path RoCE need to be revisited. Second, every networking-OSS vendor who shipped OSPF/BGP optimizations specifically for AI fabrics now has a market that's getting smaller; the OpenAI consortium is the largest single deployment of dynamic-routing elimination ever documented, and where they go, NVIDIA/Microsoft/Oracle customers follow. Third, the paper is genuinely worth reading end-to-end — Reddy's TDS deep-read is a useful guide, but the arXiv reference (2605.04333) is the canonical source. The "five counterintuitive decisions" framing is editorial; the actual surprise is that each one passed the production-stress test simultaneously in a 131K-GPU deployment, and the OpenAI consortium chose to publish how rather than keep the engineering proprietary.

OpenAI MRC fabric: 131K GPUs, no L3 routing, 8-plane spraying, lossy Ethernet

More News