DeepMind's Decoupled DiLoCo trains a 12B Gemma 4 across 4 US regions, holds 88% goodput at 1.2M-chip scale

Google DeepMind published Decoupled DiLoCo Wednesday, an extension of its DiLoCo low-communication distributed training work that partitions training runs across asynchronous islands of compute. The paper reports training a 12-billion-parameter Gemma 4 model across four US regions using mixed TPU generations (v6e and v5p), hitting 64.1% average benchmark accuracy against a 64.4% tightly-synchronized baseline. The communication reduction is dramatic: from 198 Gbps to 0.84 Gbps across 8 datacenters, a 235x drop. The resilience claim is stronger still. In a simulated 1.2-million-chip environment with high failure rates, Decoupled DiLoCo maintained 88% goodput while conventional synchronous training collapsed to 27%.

The architecture builds on DiLoCo's two-level structure: inner local optimization steps on each worker, outer synchronization of parameter deltas at intervals. Decoupled DiLoCo replaces the synchronous outer loop with an asynchronous one. Independent learners compute local updates and push parameter fragments to a central synchronizer, which aggregates them using a minimum quorum rule, an adaptive grace window for stragglers, and dynamic token-weighted merging so faster learners contribute proportionally more to each update cycle. The word decoupled is load-bearing. Failed or slow workers do not block the global step; they time out of the grace window and get reincorporated when they recover. That is why the goodput curve holds up under failures that cripple traditional synchronous training.

The significance for production ML teams is twofold. First, the bandwidth reduction changes which training topologies are economically viable. Training across geographically distributed datacenters has been gated by the inter-region bandwidth cost of gradient synchronization. A 235x bandwidth reduction puts multi-region training within reach of any cloud tenant with standard interconnects. Second, the failure tolerance matters at the scales Google, Meta, and other hyperscalers now operate. Training at 100K-plus chips means hardware failures are routine rather than exceptional. Synchronous training treats each failure as a restart; Decoupled DiLoCo treats failures as stragglers and keeps the learners that are still running. At the 1.2M simulated chip scale, that difference between 88% and 27% goodput represents billions of dollars of compute efficiency over a multi-month run.

For builders working below hyperscaler scale, the research is still useful. The quorum-plus-grace-window pattern generalizes beyond training. If you are building any distributed system that needs to aggregate contributions from unreliable workers, adaptive grace windows plus minimum quorums plus weighted merging is a known-good design. The open-source DiLoCo lineage continues through Prime Intellect's OpenDiLoCo framework, which decentralized community training efforts have been extending since 2024. Expect Decoupled DiLoCo's specific innovations to land in those open implementations within weeks. The takeaway for model developers outside Google is that the assumptions baked into most distributed training recipes, tight synchronization, single-datacenter deployment, uniform hardware, are now explicitly challenged by a working 12B-parameter demonstration at research scale. Production frameworks will catch up, and teams that understand why sooner will be better positioned to exploit the flexibility.

DeepMind's Decoupled DiLoCo trains a 12B Gemma 4 across 4 US regions, holds 88% goodput at 1.2M-chip scale

More News