Claude Code Auto Mode: Sonnet 4.6 दो-चरण classifier, 0.4% false positives

Anthropic ने Claude Code के Auto Mode के पीछे की technical architecture publish की — अप्रैल से GA हो चुका permission system, जिसने हर action पर approval को classifier-gated execution से replace किया। post के key numbers: दो-stage filter के बाद 0.4% false-positives, detected overeager actions पर 17% false-negatives, classifier design से assistant prose और tool outputs के लिए blind है। autonomous loops या subagent fan-outs में Claude Code चलाने वाले builders के लिए ये numbers safety circuit से क्या expect करना है वो shape करते हैं — और नीचे की architectural choice अपने आप में दिलचस्प है।

तीन tiers of actions। Tier 1 — file reads, text और file-pattern search, code navigation, todo और plan-mode transitions — एक fixed allowlist पर चलते हैं, कभी classifier-gated नहीं। Tier 2 — project directory के अंदर file writes और edits — बिना classifier review के auto-permitted। Tier 3 — shell commands, web fetches, external tool integrations, subagent spawns, project के बाहर filesystem operations — classifier से होकर जाते हैं। classifier Sonnet 4.6 है, दो stages: एक fast single-token «yes/no» filter (अकेला 8.5% FPR), chain-of-thought reasoning सिर्फ़ तब kick in होती है जब stage 1 flag करे (full pipeline 0.4% FPR तक collapse होकर)। एक ही prompt दोनों stages को feed करता है, बस final instruction बदलती है, तो prompt caching ज़्यादातर काम करता है। important design choice: classifier **intent से reasoning-blind है** — सिर्फ़ user messages और agent के tool calls देखता है, assistant prose या tool outputs कभी नहीं। ये meta-prompt-injection threat के ख़िलाफ़ deliberate isolation है जहाँ tool output gatekeeper को convince कर ले कि कोई dangerous चीज़ ठीक है। Subagent calls दो बार gated होती हैं: delegation पर outbound (unauthorized हो तो block), return-phase warning की तरह (discard नहीं)। 3 consecutive denials या 20 total circuit-breaker trip करते हैं और human तक escalate करते हैं।

इस पर ecosystem reading ये है कि Anthropic दाँव लगा रहा है कि model खुद को police करने के लिए काफ़ी अच्छा है, deliberate context-stripping safety के तौर पर। Cursor का permission story config में rules-based है; Aider approve-each पर रहता है; OpenAI का Codex container isolation पर ज़्यादा depend करता है, runtime classifier gating पर कम। Anthropic का दाँव uniquely architectural है — same model family काम और oversight दोनों कर रही है, input shape (कोई prose नहीं, कोई outputs नहीं) trust mechanism की तरह। 0.4%/17% tradeoff मायने रखता है: builders हर ~250 gated actions में ~1 spurious block trade करते हैं overeager moves के छोटे leak rate के बदले जो pass हो जाएँ। Auto Mode entry user-configured aggressive allowlist rules भी drop करती है — blanket shell access, wildcarded python/node/ruby interpreters, package manager run commands — मतलब आपने जो custom configs बनाए वो approve-each की तरह carry over नहीं होते। long-running session को auto में flip करने से पहले जानने लायक gotcha यही है।

practical move: अगर आप speed के लिए tuned custom allowlist rules के साथ Claude Code चला रहे हो, Auto Mode enable करने से पहले audit करो — आपकी blanket shell rules strip हो जाती हैं। अगर subagents heavily use करते हो, expect करो कि outbound gating delegation पर latency जोड़ेगी; return-phase flag informational है, blocking नहीं। full-pipeline 0.4% FPR पर, long sessions में occasional spurious approval prompts plan करो पर उनके चारों ओर architecture मत बनाओ। 3-consecutive / 20-total circuit-breaker floor है — अगर hit करो, ये system आपको actually देखने को कह रहा है। reasoning-blind design अपने agent stacks के बारे में सोचने वाले builders के लिए signal flag है: जब classifier और actor weights share करते हैं, हर एक जो देखता है वो अलग होना चाहिए, वरना actor classifier को उसके judgment से बाहर बात कर सकता है।

Claude Code Auto Mode: Sonnet 4.6 दो-चरण classifier, 0.4% false positives

और समाचार