Claude Code Auto Mode: Sonnet 4.6 two-stage classifier, 0.4% false positives

Anthropic published the technical architecture behind Claude Code's Auto Mode, the GA-since-April permission system that replaced approve-each-action with classifier-gated execution. The headline numbers from the post: 0.4% false-positive rate after the two-stage filter, 17% false-negative on detected overeager actions, classifier-blind by design to assistant prose and tool outputs. For builders running Claude Code in autonomous loops or subagent fan-outs, those numbers shape what to expect from the safety circuit — and the architectural choice underneath is interesting on its own.

Three tiers of action. Tier 1 — file reads, text and file-pattern search, code navigation, todo and plan-mode transitions — runs on a fixed allowlist, never classifier-gated. Tier 2 — file writes and edits inside the project directory — auto-permitted without classifier review. Tier 3 — shell commands, web fetches, external tool integrations, subagent spawns, filesystem operations outside the project directory — routes through the classifier. The classifier is Sonnet 4.6, two-stage: a fast single-token "yes/no" filter (8.5% FPR alone), with chain-of-thought reasoning kicking in only when stage 1 flags (collapsing to 0.4% FPR full pipeline). The same prompt feeds both stages with only the final instruction changed, so prompt caching does most of the work. The design choice that matters: the classifier is **reasoning-blind by intent** — it sees only user messages and the agent's tool calls, never the assistant's prose or tool outputs. That's a deliberate isolation against the meta-prompt-injection threat where tool output convinces the gatekeeper itself that something dangerous is fine. Subagent calls run gated twice: outbound at delegation (block if unauthorized), return-phase as warning rather than discard. Three consecutive denials or twenty total trips the circuit-breaker and escalates to human.

The ecosystem read on this is that Anthropic is betting the model is good enough to police itself, with deliberate context-stripping as the safety. Cursor's permission story is rules-based config; Aider stays approve-each; OpenAI's Codex relies more on container isolation than runtime classifier gating. Anthropic's bet is uniquely architectural — same model family doing the work and the oversight, with input shape (no prose, no outputs) as the trust mechanism. The 0.4%/17% trade matters: builders trade ~one spurious block per 250 gated actions for a small leak rate of overeager moves that slip through. Auto Mode entry also drops aggressive user-configured allowlist rules — blanket shell access, wildcarded python/node/ruby interpreters, package manager run commands — meaning custom configs you've built up don't carry over the way they do under approve-each. That's the gotcha to know before flipping a long-running session into auto.

Practical move: if you're running Claude Code with custom allowlist rules tuned for speed, audit them before enabling Auto Mode — your blanket shell rules get stripped. If you're using subagents heavily, expect outbound gating to add latency at delegation; the return-phase flag is informational, not blocking. With 0.4% FPR full-pipeline, plan for occasional spurious approval prompts in long sessions but don't architect around them. The 3-consecutive / 20-total circuit-breaker is the floor — if you hit it, that's the system telling you to actually look. The reasoning-blind design is the signal flag for builders thinking about their own agent stacks: when the classifier and the actor share weights, what they each see has to be different, or the actor can talk the classifier out of its judgment.

Claude Code Auto Mode: Sonnet 4.6 two-stage classifier, 0.4% false positives

More News