NVIDIA shipped a single 30B-parameter checkpoint this week that can be sliced โ without retraining โ into 23B or 12B variants. The smaller models don't degrade like distilled fallbacks; they're co-trained inside the larger one and extracted via importance ranking at deploy time. For anyone running reasoning workloads at scale, the assumption that "smaller cheaper model" and "larger smarter model" are different files just got loosened.
The base is Nemotron Nano v3, NVIDIA's Mamba-Transformer-MoE hybrid (3.6B active of 30B total). Star Elastic's mechanism is width-based elastic training: components โ embedding channels, attention heads, Mamba SSM heads, MoE expert count, FFN intermediate dimensions โ get ranked by contribution and packed so the top-ranked contiguous subset stays when you slice down. A learnable Gumbel-Softmax router trains jointly with the model, selecting what activates per parameter budget. NVIDIA tested depth compression (dropping layers) and found it recovered 95.2% of baseline; width compression hit 98.1%, so they prioritized width. Elastic-23B scores 85.63 on AIME-2025 versus Qwen3-30B-A3B's 80.00. Training cost: 360ร fewer tokens than pretraining three separate variants, 7ร fewer than sequential distillation-based compression. Nested quantization preserves slicing across FP8 and NVFP4, so the same checkpoint covers precision selection too.
The Llama-family deployment pattern has been "ship a 7B, 13B, 70B family โ separately pretrained, separately distilled, separately hosted." MatFormer (Salesforce) and Megatron Elastic explored nested approaches with fixed importance ranking and single-axis pruning. What's new here is joint training with a learnable router across multiple pruning axes simultaneously โ SSM dimension, embedding channels, expert count, FFN width โ plus REAP (Router-Weighted Expert Activation Pruning) ranking MoE experts by gate ร output magnitude rather than just routing frequency. For wrapper economies and agent stacks that currently route between fast-cheap and slow-deep model calls on separate endpoints, the architectural assumption that those are different models loosens. One checkpoint, latency dial.
Available on Hugging Face as `nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B` in BF16, FP8, and NVFP4 variants. Worth grabbing for H100/H200 reasoning workloads where you currently swap models for cost/latency tradeoffs. Check the model card for license terms before commercial deployment โ Nemotron variants have shipped under mixed licenses, and the release notes didn't include an explicit open-source statement.
