Nous CNA steers refusals in 0.1% of MLP neurons — no SAE training needed

Nous Research dropped a paper plus code this week showing that the refusal behaviour in instruction-tuned LLMs lives in roughly 0.1% of MLP activations — and you can locate that circuit with nothing more than contrastive forward passes. No SAE training, no gradient computation, no weight modification. The method, Contrastive Neuron Attribution (CNA), takes paired harmful/benign prompts, diffs the per-neuron activations across the MLP, ranks by separation, filters out "universal" neurons (those active on 80%+ of prompts), and applies a scalar multiplier at inference to ablate or amplify the surviving set. For builders working on safety, eval, or behaviour steering, this is the cheapest steering primitive yet published.

The numbers came from a 16-model sweep — Llama 3.1/3.2 and Qwen 2.5, 1B to 72B, base and instruct. On JBB-Behaviors (100 harmful prompts): Qwen 2.5-7B-Instruct dropped from 87% refusal to 2% after ablating the top-0.1% neurons (−97.7%); Llama-3.1-70B-Instruct from 86% to 18% (−79.1%); Llama-3.2-3B-Instruct from 84% to 47% (−44%). Output quality stayed above 0.97 at all steering strengths versus the Contrastive Activation Addition baseline which dropped below 0.60 on six of eight instruct models. MMLU stayed within one point of baseline — meaning the steering doesn't tank general capability. Paper at arXiv 2605.12290, code at github.com/NousResearch/neural-steering.

What this changes for the ecosystem: SAE-based circuit steering (the Anthropic / Goodfire line) requires training a sparse autoencoder per model layer at significant compute cost, then handling activation noise. CNA gets to a usable steering vector with forward passes and a contrastive prompt set. That collapses the cost of interpretability-driven behaviour control by orders of magnitude — which means it's now cheap enough to integrate into red-team pipelines, post-training safety audits, and per-deployment behaviour tuning. The flip side is honest: a method that locates the refusal circuit in 0.1% of neurons is equally a method to remove it. Nous is upfront that ablation drops refusal rates by 80-98% on instruct models. Defensive use (auditing what your model considers harmful) and offensive use (stripping refusals) are the same operation with the multiplier sign flipped.

Tradeoffs that bound the result. Tested only on gated-SiLU MLPs with grouped-query attention — MoE models like Mixtral, DeepSeek-V3 and the newer mixture architectures are unvalidated. Base (non-instruct) models show no behavioural change under ablation, confirming the refusal circuit emerges during instruction tuning. Quality depends on contrastive pair curation — bad pairs give noisy circuits. Amplification factors above 1 trigger repetition collapse. Monday morning: if you're shipping anything on top of open Llama or Qwen instruct models, clone github.com/NousResearch/neural-steering and run the JBB sweep yourself before someone else does on your endpoint. The interpretability primitive is now public; the question is whether your safety posture assumed it stayed expensive.

Nous CNA steers refusals in 0.1% of MLP neurons — no SAE training needed

More News