Nous TST cuts pretraining wall-clock 2.5× via token-bag phase then recovery, Zubnet AI News

Nous Research released Token Superposition Training (TST) this week — a two-phase pretraining method that cuts wall-clock training time by up to 2.5× at matched FLOPs without changing model architecture, optimizer, tokenizer, parallelism, or training data. The headline result is at the 10B-A1B mixture-of-experts scale: TST reaches a lower final training loss than a matched-FLOPs baseline while consuming 4,768 B200-GPU-hours versus the baseline's 12,311. The technique is validated at four scales — 270M and 600M dense (SmolLM2 shapes adapted to Llama3 modeling code), 3B dense (SmolLM3 shape), and a 10B-A1B MoE in the Qwen3 family — using DCLM for smaller runs and a 50/50 DCLM/FineWeb-Edu mix for the MoE run. All runs use AdamW with Warmup-Stable-Decay LR scheduling under TorchTitan + FSDP on 8 or 64 NVIDIA B200 GPUs. The final model is architecturally identical to one produced by conventional pretraining; inference behavior is unchanged.

The mechanism splits cleanly into two phases. Phase 1 (the superposition phase, run for r ∈ [0.2, 0.4] of total training steps) segments the input sequence of length L into non-overlapping bags of s contiguous tokens, then collapses each bag into a single latent "s-token" by averaging the s embeddings. The transformer then processes a sequence of length L/s. To keep each TST step equal-FLOPs to a standard training step, the data sequence length is increased by s× during the superposition phase — so the model ingests s× more text per unit of compute, which is the source of the throughput gain. On the output side, each latent position predicts the next bag of s tokens, with a multi-hot cross-entropy loss assigning 1/s probability mass to each target — implementable using existing fused CE kernels, no new kernels or auxiliary heads required. Phase 2 (recovery) resumes from the saved checkpoint with standard next-token prediction for the remaining 1-r steps. A transient 1-to-2-nat loss spike appears at the transition and resolves within a few thousand steps; from there the recovered model crosses below the equal-FLOPs baseline and stays.

The honest hedging in Nous's paper is the part that matters most. The team explicitly presents three comparison views: equal-FLOPs (TST wins), equal-loss (TST wins), and equal-data (baseline wins, because TST's effective compute per data token is smaller). This is the boundary condition that determines where TST applies — compute-bound pretraining benefits, data-bound pretraining does not. Given recent industry discussion about data scarcity, more shops than expected may discover they are data-bound in practice. The ablation results are also load-bearing: an ablation where input embeddings and LM head are randomly re-initialized at the Phase 2 boundary jumps final loss to 2.938 (worse than both TST at 2.676 and the standard baseline at 2.808). Phase 1 representations are not throwaway — shared representations across phases are what makes TST work. The input-side (token averaging) and output-side (next-bag prediction) mechanisms independently outperform the baseline and combine without interference, suggesting two orthogonal mechanisms rather than one trick. Concrete benchmarks at the 10B-A1B MoE scale: HellaSwag 71.2 vs 70.1 baseline, ARC-Easy 74.2 vs 73.8, ARC-Challenge 47.3 vs 46.3, MMLU 39.0 vs 37.4.

For builders pretraining anything from a small SLM to a frontier-class MoE: the practical question becomes whether your workload is compute-bound (TST helps materially) or data-bound (TST hurts you because it consumes more data tokens per FLOP). Nous's reference setup — r between 0.2 and 0.4, s between 6 (at 3B) and 16 (at 10B-A1B) — is the starting parameterization to ablate against. The technique sits in the same broader class as multi-token prediction (MTP), but is the least expensive member of that class: a single output head, target replacement only, existing CE kernels. Unlike MTP, TST shows consistent gains across all tested scales including small models where MTP has been shown to degrade performance. The paper is at arXiv 2605.06546 and the implementation should be released via Nous's standard channels (the same publishing pattern as Hermes Agent earlier this week). For shops with active pretraining roadmaps, this is the rare engineering contribution that's worth ablating in your own pipeline within the month.

Nous TST cuts pretraining wall-clock 2.5× via token-bag phase then recovery

More News