Nous CNA: 0.1% MLP neurons में refusal को steer करता है — SAE training नहीं चाहिए

Nous Research ने इस हफ़्ते एक paper और code drop किया जो दिखाता है कि instruction-tuned LLMs में refusal behaviour लगभग 0.1% MLP activations में रहता है — और तुम उस circuit को सिर्फ़ contrastive forward passes से locate कर सकते हो। कोई SAE training नहीं, कोई gradient computation नहीं, कोई weight modification नहीं। यह method, Contrastive Neuron Attribution (CNA), harmful/benign prompts के pairs लेता है, MLP में per-neuron activations का diff करता है, separation से rank करता है, "universal" neurons (जो 80%+ prompts पर active होते हैं) को filter करता है, और surviving set को ablate या amplify करने के लिए inference पर scalar multiplier apply करता है। Safety, eval, या behaviour steering पर काम कर रहे builders के लिए, यह अब तक publish हुआ सबसे सस्ता steering primitive है।

Numbers एक 16-model sweep से आए — Llama 3.1/3.2 और Qwen 2.5, 1B से 72B, base और instruct। JBB-Behaviors (100 harmful prompts) पर: Qwen 2.5-7B-Instruct top-0.1% neurons ablation के बाद 87% refusal से 2% पर आ गया (−97.7%); Llama-3.1-70B-Instruct 86% से 18% (−79.1%); Llama-3.2-3B-Instruct 84% से 47% (−44%)। Output quality सभी steering strengths पर 0.97 से ऊपर रही, जबकि Contrastive Activation Addition baseline आठ में से छह instruct models पर 0.60 से नीचे गिर गई। MMLU baseline से एक point के अंदर रहा — steering general capability नहीं तोड़ रहा। Paper arXiv 2605.12290 पर, code github.com/NousResearch/neural-steering पर।

Ecosystem के लिए यह क्या बदलता है: SAE-based circuit steering (Anthropic / Goodfire line) के लिए model की हर layer के लिए एक sparse autoencoder train करना पड़ता है significant compute cost पर, और फिर activation noise handle करनी पड़ती है। CNA forward passes और एक contrastive prompt set से usable steering vector तक पहुँचता है। यह interpretability-driven behaviour control की cost को orders of magnitude घटा देता है — मतलब यह अब इतना सस्ता है कि red-team pipelines, post-training safety audits, और per-deployment behaviour tuning में integrate हो सके। दूसरा side भी honest है: एक method जो refusal circuit को 0.1% neurons में locate करती है वही उसे हटाने की भी method है। Nous खुलकर बताता है कि ablation instruct models पर refusal rates को 80-98% तक गिराती है। Defensive use (तुम्हारा model क्या harmful मानता है, audit करना) और offensive use (refusals strip करना) एक ही operation है, सिर्फ़ multiplier का sign flip हुआ है।

जो result को bound करते हैं वो tradeoffs। केवल gated-SiLU MLPs पर grouped-query attention के साथ test किया गया — Mixtral, DeepSeek-V3 जैसे MoE models और नई mixture architectures unvalidated हैं। Base (non-instruct) models में ablation पर behavioural change नहीं दिखता, यह confirm करता है कि refusal circuit instruction tuning के दौरान emerge होता है। Quality contrastive pair curation पर निर्भर है — bad pairs noisy circuits देते हैं। Amplification factors 1 से ऊपर repetition collapse trigger करते हैं। Monday सुबह: अगर तुम open Llama या Qwen instruct models के ऊपर कुछ भी ship कर रहे हो, github.com/NousResearch/neural-steering clone करो और JBB sweep ख़ुद चलाओ इससे पहले कि कोई और तुम्हारे endpoint पर चलाए। Interpretability primitive अब public है; सवाल यह है कि तुम्हारी safety posture ने यह assume किया था कि यह महँगा रहेगा।

Nous CNA: 0.1% MLP neurons में refusal को steer करता है — SAE training नहीं चाहिए

और समाचार