Anthropic published the technical architecture behind Claude Code's Auto Mode, the GA-since-April permission system that replaced approve-each-action with classifier-gated execution. The headline numbers from the post: 0.4% false-positive rate after the two-stage filter, 17% false-negative on detected overeager actions, classifier-blind by design to assistant prose and tool outputs. For builders running Claude Code in autonomous loops or subagent fan-outs, those numbers shape what to expect from the safety circuit โ and the architectural choice underneath is interesting on its own.
Three tiers of action. Tier 1 โ file reads, text and file-pattern search, code navigation, todo and plan-mode transitions โ runs on a fixed allowlist, never classifier-gated. Tier 2 โ file writes and edits inside the project directory โ auto-permitted without classifier review. Tier 3 โ shell commands, web fetches, external tool integrations, subagent spawns, filesystem operations outside the project directory โ routes through the classifier. The classifier is Sonnet 4.6, two-stage: a fast single-token "yes/no" filter (8.5% FPR alone), with chain-of-thought reasoning kicking in only when stage 1 flags (collapsing to 0.4% FPR full pipeline). The same prompt feeds both stages with only the final instruction changed, so prompt caching does most of the work. The design choice that matters: the classifier is **reasoning-blind by intent** โ it sees only user messages and the agent's tool calls, never the assistant's prose or tool outputs. That's a deliberate isolation against the meta-prompt-injection threat where tool output convinces the gatekeeper itself that something dangerous is fine. Subagent calls run gated twice: outbound at delegation (block if unauthorized), return-phase as warning rather than discard. Three consecutive denials or twenty total trips the circuit-breaker and escalates to human.
The ecosystem read on this is that Anthropic is betting the model is good enough to police itself, with deliberate context-stripping as the safety. Cursor's permission story is rules-based config; Aider stays approve-each; OpenAI's Codex relies more on container isolation than runtime classifier gating. Anthropic's bet is uniquely architectural โ same model family doing the work and the oversight, with input shape (no prose, no outputs) as the trust mechanism. The 0.4%/17% trade matters: builders trade ~one spurious block per 250 gated actions for a small leak rate of overeager moves that slip through. Auto Mode entry also drops aggressive user-configured allowlist rules โ blanket shell access, wildcarded python/node/ruby interpreters, package manager run commands โ meaning custom configs you've built up don't carry over the way they do under approve-each. That's the gotcha to know before flipping a long-running session into auto.
Practical move: if you're running Claude Code with custom allowlist rules tuned for speed, audit them before enabling Auto Mode โ your blanket shell rules get stripped. If you're using subagents heavily, expect outbound gating to add latency at delegation; the return-phase flag is informational, not blocking. With 0.4% FPR full-pipeline, plan for occasional spurious approval prompts in long sessions but don't architect around them. The 3-consecutive / 20-total circuit-breaker is the floor โ if you hit it, that's the system telling you to actually look. The reasoning-blind design is the signal flag for builders thinking about their own agent stacks: when the classifier and the actor share weights, what they each see has to be different, or the actor can talk the classifier out of its judgment.
