NVIDIA NVFP4 paper: 4-bit pretrain hits FP8 parity on 12B at 10T tokens, Zubnet AI News

NVIDIA published "Pretraining Large Language Models with NVFP4" on arXiv (2509.25149v2) describing the methodology behind 4-bit pretraining for Nemotron-Nano-12B-v2-Base — a 62-block hybrid (6 attention + 28 FFN + 28 Mamba-2), 5120 hidden, trained for 10 trillion tokens. The downstream accuracy lands within noise of the FP8 baseline: MMLU-Pro 62.58% vs 62.62%, GSM8K-CoT 92.27% vs 89.08% (NVFP4 actually higher), validation loss within 1% of FP8 during the stable phase, widening to ~1.5% during decay. Hardware target: NVIDIA Blackwell Tensor Cores. Throughput: roughly 2× over FP8 on GB200, 3× on GB300. Operand memory footprint approximately halved. This is the longest publicly documented 4-bit pretraining run.

The four stabilization techniques are the actual deliverable, and the paper's ablations report all four as necessary. First, selective high precision — about 16% of linear layers in BF16, concentrated in the first 2 and final 8 of 62 blocks. Second, 16×16 Random Hadamard Transforms with random ±1 sign vectors applied only to Wgrad inputs. Third, 2D block scaling for weights so the forward and backward pass see the same quantized representation. Fourth, stochastic rounding on gradients only — the paper notes it is "detrimental on forward-pass tensors." The format itself is E2M1 elements in 16-element blocks with E4M3 scale factors plus an FP32 per-tensor scale overlay, ensuring at least 6.25% of values in each block sit at near-FP8 precision.

Position this against MXFP4, the prior 4-bit microscaling format. On 8B at 1T tokens, NVFP4 has a 1.5% loss gap to BF16; MXFP4 has 2.5%. To match NVFP4 accuracy, MXFP4 needs 1.36T tokens — 36% more. That's a measurable wall-clock advantage that flows through to total cost of ownership. Two things to track. First, the recipe transfers only as far as Blackwell — pre-Blackwell hardware won't see the 2-3× speedup, though the algorithmic techniques are extractable. Second, the paper itself flags pending work: not all linear layers are quantized (the ~16% BF16 holdout), attention and communication paths aren't yet 4-bit, and scaling laws for FP4 across parameter counts and horizons remain open.

Monday: if you're pretraining on Blackwell-class hardware (GB200/GB300) at any non-trivial scale, the NVFP4 methodology is reproducible from the paper plus NVIDIA Transformer Engine support. Implementation gating: the four stabilization techniques together, not individually. Skipping stochastic rounding gives biased gradients; skipping Random Hadamard breaks Wgrad statistics; skipping 2D weight scaling breaks fwd/bwd consistency. The Nemotron-Nano-v2 architecture (Mamba + FFN + Attention hybrid) is independent of the NVFP4 method — the recipe should transfer to dense transformer pretraining too, though the validation runs aren't reported for that case. If you're not on Blackwell, treat this as a forward-looking reference for when you upgrade.

NVIDIA NVFP4 paper: 4-bit pretrain hits FP8 parity on 12B at 10T tokens

More News