Agent security: 4 attack patterns, proposal/execution split as architecture, Zubnet AI News

A new survey on agentic AI security names four concrete attack patterns against LLM-based operations agents that hold real production access. Prompt injection: malicious instructions embedded in a Jira ticket or wiki page steer the agent toward an unsafe action. Retrieval poisoning: corrupted runbooks and incident histories bias agent diagnoses toward attacker objectives. Retrieval jamming: flooding knowledge bases with blocker documents triggers refusal loops, stalling incident response — a denial-of-service against the agent's decision loop. Telemetry manipulation: attackers influence metrics and logs to steer mitigation decisions without ever touching the model itself. Common thread: the confused-deputy problem. The agent has legitimate API access, but the artifacts shaping its decisions — tickets, logs, transcripts, wiki pages, retrieved documents — are exactly the surfaces attackers can compromise.

The proposed defense is architectural rather than model-level. Split proposal from execution: the language model reasons, retrieves evidence, drafts change proposals — and cannot execute writes. All production changes pass through non-bypassable gates enforcing policy checks, invariant verification, human approval where the change warrants, and staged rollback. The risk tiering the survey lands on: read-only assistance is low-risk; bounded execution with strong gates is defensible; open-ended self-healing without verification scaffolding is the higher-risk claim that deserves skepticism. The evaluation gap is the part most builders should pay attention to: current benchmarks miss tool-call traces, gate-violation rates, adversarial input behavior, refusal-storm rates under jamming, rollback completeness. Systems performing well on clean incidents can collapse under hostile Jira tickets and the eval suite would never know.

Ecosystem context. This is the threat-model side of what Anthropic shipped this week with Managed Agents and MCP Tunnels. The architectural primitives that let agents reach production systems are also where the confused-deputy class of attack opens up. Anthropic's Auto Mode destructive-action screening (announced at Code With Claude) is one shape of the gate this survey calls for; the broader question is what set of gates is sufficient for which risk tier. Current eval landscape gap is structural: SWE-bench Verified, MMLU, and clean-incident agent benchmarks measure capability under cooperating inputs. Adversarial robustness — refusal-storm rates, gate-violation rates, prompt-injection resistance — is largely unmeasured at the benchmark level. Anthropic's Capability Curve narrative (62 to 87% on SWE-bench Verified) measures one axis; this survey's framing shows the orthogonal axis is where production-grade agents actually live or die. For wrapper-ecosystem builders (LangGraph, AutoGen, CrewAI), the confused-deputy framing has design implications: state management and tool-call routing layers are where the gates need to live, not in the model itself.

Monday: if you ship agents with production access (CI runners, incident response, infra automation, support-side ticketing automation), audit your stack against the four patterns this week. Concrete actions. First, list every input the agent treats as trusted — tickets, wiki, telemetry, Slack threads, retrieved documents — and assume each can be hostile; the threat is content-injected by an attacker, not model jailbreak. Second, implement proposal-execution split: the agent drafts, a non-bypassable gate (policy check, invariant verify, optional human approval) executes. The gate is where the security review concentrates, not the model prompt. Third, add evals for adversarial inputs — at minimum, prompt-injected tickets, poisoned retrieval contexts, and refusal-storm scenarios. Fourth, watch refusal-storm rates as an explicit metric. An agent that "won't act under hostile inputs" looks safe in isolation but stalls real incident response under jamming — both failure modes need separate budgets. The clean-eval-benchmark trap is real. Adversarial robustness is the next eval axis after raw capability, and most production agent deployments are not measuring it yet.

Agent security: 4 attack patterns, proposal/execution split as architecture

More News