Dreadnode published research using an automated red-teaming agent โ Moonshot AI's Kimi 2.5 serving as both attacker and judge โ against Meta's Llama Scout (17 billion parameters, released April 2025). Headline: 85% success across 68 adversarial goals, three attack types with five transform variants. Crescendo (iterative softening of refusal), Graph of Attacks with Pruning (search through attack space), and persona-based transforms (skeleton-key role-play) each hit 100%. Base64 encoding 75%. Translation into low-resource languages also tested. The paper acknowledges humans still outperform the agent on long-horizon reasoning and complex social engineering. No formal comparison with expert human operators was conducted. Citation: arxiv.org/pdf/2410.02828.
The "Kimi 2.5 as attacker AND judge" setup is the methodology innovation. Standard human red-teaming has an attacker (red team) and a separate judge (eval team or safety org). Replacing both with the same LLM lets you scale to 68 adversarial goals at machine speed โ far more than human red teams can run on a comparable budget. Crescendo, Graph of Attacks with Pruning, and persona-based skeleton-key attacks are all known techniques from the safety-research literature; what's new is the automated agent applying them at scale with high reproducibility. Base64 encoding and low-resource language translation are simpler obfuscations that still defeat current safety training in a non-trivial fraction of cases. The 85% overall plus 100% on three attack types means: against Llama Scout, automated red-teaming finds a working jailbreak essentially every time on most attack categories. Llama Scout being open-weight matters for the threat model โ anyone can download and study, anyone can run the same red-teaming pipeline. The Dreadnode result quantifies what was previously an assumption.
This is the offense-side complement to yesterday's coverage of agent security (proposal-execution split, four attack patterns, eval gap). Where yesterday's piece said "your evals don't measure adversarial robustness," today's says "automated red-team agents hit 85% on production-grade open-weight LLMs โ your evals definitely don't catch that." The humans-still-better caveat matters: automated agents at 85% on single-turn and bounded multi-turn attacks, but genuine long-horizon reasoning and human social-engineering edge cases remain harder. That's where adversarial evals should focus next. For builders deploying Llama Scout or similar open-weight models behind customer-facing surfaces: the assumption "publishing weights doesn't help attackers because they could probe via API anyway" is now quantitatively false. Open weights plus agent red-teaming equals 85% success against current safety training. The defensive primitives in yesterday's coverage โ proposal-execution split, non-bypassable gates, policy checks โ are the only mitigation that matters once you accept the model itself is jailbreakable at this rate.
Monday: if your stack uses Llama Scout (17B), Llama 3.1, Llama 3.3, or any similar-class open-weight model behind a customer-facing surface, assume current safety filters are insufficient. Concrete actions. First, read the Dreadnode paper (arxiv.org/pdf/2410.02828) and reproduce the agent setup against your specific deployed model plus system-prompt combination. Run it at least quarterly. Use Kimi 2.5 or a comparable strong model as attacker and judge. Second, build the gates described in yesterday's security piece โ proposal-execution split, policy checks, invariant verification โ they are the only defense layer that matters once you accept the model itself is jailbreakable 85% of the time. Third, treat refusal rates as a first-class safety metric, not just accuracy on benign evals. If your model never refuses anything adversarial, you have no signal; if it refuses everything, you have the refusal-storm denial-of-service problem from yesterday. The middle band is where production lives, and you need to measure it. Fourth, add Crescendo and Graph of Attacks with Pruning to your adversarial eval suite by default. You don't need to wait for a paper to tell you your model fails โ the 100% success rates in this study say it does.
