Tilde Research released Aurora, a new "leverage-aware" optimizer that fixes a hidden bug in Muon โ the optimizer that has been quietly powering modded-nanoGPT speedruns and a growing set of frontier training pipelines. The bug: in tall matrices like MLP layers, Muon's polar-factor update creates row-norm anisotropy, which causes some neurons to receive massive updates while others get virtually nothing. By the 500th training step, more than one in four neurons are effectively dead. Aurora keeps that from happening at roughly 6% extra compute, drops in as a Muon replacement, and ships with open code on GitHub.
The diagnosis matters more than the fix. NorMuon (a prior intermediate) corrected for row-norm anisotropy via post-hoc normalization to unit norm, got good empirical results, but didn't explain why the underlying problem existed. Aurora's analysis: Muon's polar-factor update does the right thing for square matrices and the wrong thing for tall ones โ and "tall" describes most MLPs with large expansion factors, so the bug compounds in exactly the architectures everyone is training. Aurora reformulates the weight update as a joint constraint: left semi-orthogonality AND uniform row norms, solved simultaneously rather than patched after the fact. Two implementations ship: Riemannian Aurora (gradient projection on constrained manifold) and vanilla Aurora (simpler practical variant). Tilde reports 100ร data efficiency on open-source internet data at the 1.1B scale, new state-of-the-art on the modded-nanoGPT speedrun (surpassing NorMuon's prior SOTA), and outperforming larger models on HellaSwag. The 100ร claim wants harness disclosure before being treated as gospel โ that's a generational result, not incremental โ but the speedrun SOTA is the more verifiable point because it has public reference numbers everyone can compare against.
Muon has been gaining adoption since late 2024 as a more compute-efficient alternative to AdamW, especially for nanoGPT-style speedruns and increasingly for production frontier training runs. Aurora's diagnosis means everyone getting good results from Muon has been quietly losing about a quarter of their MLP capacity to dead neurons by step 500 โ and presumably more later in training. NorMuon was already a sign people sensed something was off without having the explanation. The broader pattern: optimizer research had a quiet decade where AdamW was treated as solved, and the recent wave (Lion, Sophia, Muon, NorMuon, now Aurora) is reopening the question. The drop-in replacement framing and 6% compute overhead is the part that makes Aurora actually adoptable rather than research curiosity โ if it ports cleanly to existing training pipelines, the bar for switching from Muon is low. The dead-neurons number is also a useful diagnostic to add to anyone's training-run dashboard, regardless of which optimizer they end up choosing.
Code on GitHub at `tilde-research/aurora-release`, paper at blog.tilderesearch.com. If you're training transformers above 100M parameter scale and using Muon, Aurora is worth a controlled A/B run on your specific workload before believing the 100ร headline number. The neuron-death framing is the part that should concern anyone running production training on Muon โ you may have been losing capacity you didn't know you were losing. For everyone else, the optimizer research wave continues to suggest that "training stability" and "training efficiency" still have substantial unsolved problems behind them, and that the labs that unsolve them gain outsized leverage on the rest of the stack.
