Nous TST: token-bag phase + recovery से pretraining wall-clock 2.5× घटाया, Zubnet AI समाचार

Nous Research ने इस सप्ताह Token Superposition Training (TST) जारी किया — एक दो-चरण pretraining method जो model architecture, optimizer, tokenizer, parallelism strategy, या training data को बदले बिना matched FLOPs पर wall-clock training time को 2.5× तक घटाता है। Headline परिणाम 10B-A1B mixture-of-experts पैमाने पर है: TST एक matched-FLOPs baseline से कम final training loss प्राप्त करता है, 4,768 B200-GPU-hours खाकर बनाम baseline के 12,311 घंटे। तकनीक चार scales पर validated है — 270M और 600M dense (Llama3 modeling code पर adapted SmolLM2 shapes), 3B dense (SmolLM3 shape), और Qwen3 family का एक 10B-A1B MoE — छोटे runs के लिए DCLM और MoE run के लिए 50/50 DCLM/FineWeb-Edu mix। सभी runs AdamW + Warmup-Stable-Decay LR scheduling के साथ TorchTitan + FSDP पर 8 या 64 NVIDIA B200 GPUs पर चलते हैं। final model architecturally उसी से identical है जो conventional pretraining से बनता है; inference behavior unchanged।

mechanism साफ़-सुथरे दो phases में बँटता है। Phase 1 (superposition phase, कुल training steps के r ∈ [0.2, 0.4] के लिए चलती है) length L के input sequence को s contiguous tokens के non-overlapping bags में segment करती है, फिर हर bag को s embeddings के औसत द्वारा एक single latent "s-token" में collapse करती है। transformer फिर length L/s के sequence को process करता है। हर TST step को standard training step के equal-FLOPs रखने के लिए, superposition phase के दौरान data sequence length को s× बढ़ाया जाता है — तो model प्रति compute unit s× गुना ज़्यादा text ingest करता है, जो throughput gain का स्रोत है। output side पर, हर latent position अगले s tokens के bag की prediction करती है, multi-hot cross-entropy loss हर target को 1/s probability mass देती है — मौजूदा fused CE kernels से implement किया जा सकता है, कोई नया kernel या auxiliary head नहीं चाहिए। Phase 2 (recovery) saved checkpoint से standard next-token prediction के साथ बाक़ी 1-r steps के लिए resume होती है। transition पर 1 से 2 nats का transient loss spike आता है जो कुछ हज़ार steps में resolve हो जाता है; इसके बाद recovered model equal-FLOPs baseline से नीचे चला जाता है और वहीं रहता है।

Nous के paper में honest hedge वह हिस्सा है जो सबसे ज़्यादा मायने रखता है। team स्पष्ट रूप से तीन comparison views प्रस्तुत करती है: equal-FLOPs (TST जीतता है), equal-loss (TST जीतता है), और equal-data (baseline जीतता है, क्योंकि TST का effective compute प्रति data token कम है)। यह boundary condition है जो TST के लागू होने की जगह निर्धारित करती है — compute-bound pretraining को लाभ होता है, data-bound pretraining को नहीं। data scarcity पर हाल की industry चर्चा को देखते हुए, उम्मीद से ज़्यादा shops practice में data-bound निकल सकती हैं। ablation results भी load-bearing हैं: एक ablation जहाँ Phase 2 boundary पर input embeddings और LM head को randomly re-initialize किया जाता है, final loss को 2.938 तक उछाल देता है (TST के 2.676 और standard baseline के 2.808 दोनों से बदतर)। Phase 1 representations throwaway नहीं हैं — phases के बीच shared representations ही वह हैं जो TST को काम करवाते हैं। input-side mechanism (token averaging) और output-side mechanism (next-bag prediction) स्वतंत्र रूप से baseline को मात देते हैं और बिना interference combine हो जाते हैं, सुझाव देते हुए कि ये एक trick के बजाय दो orthogonal mechanisms हैं। 10B-A1B MoE scale पर ठोस benchmarks: HellaSwag 71.2 vs 70.1 baseline, ARC-Easy 74.2 vs 73.8, ARC-Challenge 47.3 vs 46.3, MMLU 39.0 vs 37.4।

छोटे SLM से लेकर frontier-class MoE तक कुछ भी pretrain करने वाले builders के लिए: practical सवाल बन जाता है क्या तुम्हारा workload compute-bound है (TST materially मदद करती है) या data-bound (TST तुम्हें नुक़सान पहुँचाती है क्योंकि वह प्रति FLOP ज़्यादा data tokens खाती है)। Nous का reference setup — r 0.2 और 0.4 के बीच, s 6 (3B पर) से 16 (10B-A1B पर) के बीच — वह starting parameterization है जिसके against ablate करना है। तकनीक उसी broader class में है जिसमें multi-token prediction (MTP) — पर उस class का सबसे सस्ता सदस्य है: एक single output head, सिर्फ़ target replacement, मौजूदा CE kernels। MTP के विपरीत, TST सभी tested scales पर consistent gains दिखाता है, उन छोटे models पर भी जहाँ MTP को performance घटाते दिखाया गया है। paper arXiv 2605.06546 पर है और implementation Nous के standard channels के माध्यम से release होनी चाहिए (इस सप्ताह की शुरुआत के Hermes Agent के समान publishing pattern)। active pretraining roadmaps वाली shops के लिए, यह दुर्लभ engineering contribution है जिसे एक महीने के भीतर अपने pipeline में ablate करना सार्थक है।

Nous TST: token-bag phase + recovery से pretraining wall-clock 2.5× घटाया

और समाचार