NVIDIA Gated DeltaNet-2 decouples erase and write — NIAH 63 to 90 at 2K

NVIDIA dropped Gated DeltaNet-2 this week — a linear-attention layer that splits DeltaNet's single scalar β into two channel-wise gates, one for erase (key axis) and one for write (value axis). The update equation is the whole point: `S_t = (I − k_t (b_t ⊙ k_t)ᵀ) D_t S_{t−1} + k_t (w_t ⊙ v_t)ᵀ`. The erase gate `b_t ∈ [0,1]^d_k` controls which decoded state elements get read and removed; the write gate `w_t ∈ [0,1]^d_v` controls which new content gets committed; `D_t = Diag(α_t)` is the channel-wise decay inherited from KDA. When both gates collapse to a scalar you recover Gated DeltaNet; collapse decay too and you recover original DeltaNet. For anyone tracking the linear-attention / SSM line — Mamba-2, KDA, RWKV-v7, GDN-1 — this is the cleanest architectural delta in the family since gated decay.

Numbers at 1.3B parameters, 100B FineWeb-Edu tokens, 4K context. Pure recurrent: language modeling + reasoning average **53.11** vs Mamba-3 MIMO **52.39** vs KDA **52.28**. S-NIAH-3 @2K context jumps to **89.8** from KDA's 63.2 — a 26-point absolute gain on the canonical needle-in-haystack benchmark, attributable to the channel-wise erase/write split letting the state retain key-axis information without scalar-coupled value loss. MK-NIAH-1 @4K: **37.8** vs KDA's 28.0. Real-world retrieval average **29.88**. Hybrid (GDN-2 + 2K sliding-window attention every few layers) pushes language+reasoning to **53.97** and real-world retrieval to **42.28**, confirming the still-standard finding that mixing linear and softmax layers buys you the retrieval ceiling while keeping the linear throughput floor. Chunkwise training at chunk-size 64 with fused Triton kernels; WY backward restricted to 2-4 warps on Hopper to dodge layout assertions.

Ecosystem read: the linear-attention community has converged on "gate the decay" as the single biggest win since pure DeltaNet — KDA introduced channel-wise α, Mamba-2 has its SSD framework, RWKV-v7 has its time-mix. GDN-2's contribution is recognising that *one scalar β doing both erase and write* was the next coupling worth breaking. Once you decouple, the model can hold a key-pattern stable across many tokens (don't erase it from the key axis) while still updating the associated value (write through). That's exactly the failure mode needle-in-haystack benchmarks expose, and the 63→90 jump on S-NIAH-3 is the empirical confirmation. The 4K training-length caveat is real — long-context claims are RULER-retrieval based, not continuous generation past training length — and no throughput numbers vs baselines are published. Builders should reproduce both before committing.

Monday morning: code is up at github.com/NVlabs/GatedDeltaNet-2 (PyTorch + Triton kernels, full pretrain.py, AdamW peak LR 4e-4, 1B-token warmup). License is NVIDIA Source Code License-NC — non-commercial, no redistribution, no shipping a product with this. If you're researching architectures, fine-tuning your own SSM, or running ablations on the linear-vs-softmax frontier, clone and benchmark. If you're shipping a production model and were hoping to swap layers, the NC license blocks you; the architectural idea is reproducible from the paper and the gating equation is two sigmoids — that's the most likely community fork path.

NVIDIA Gated DeltaNet-2 decouples erase and write — NIAH 63 to 90 at 2K

More News