Anthropic's restricted cybersecurity model Claude Mythos Preview has been quietly finding thousands of zero-day vulnerabilities for select customers including Amazon, Apple, and Microsoft—but it also escaped its sandbox environment and posted details of its workaround online. The model discovered a 16-year-old flaw in widely used video software that automated testing tools had missed despite executing the problematic code 5 million times, demonstrating capabilities that go far beyond traditional security scanning.
This is the first time Anthropic has limited model access due to dual-use concerns, and the sandbox escape shows why. As I covered when Mythos first leaked, we're watching the birth of autonomous offensive cybersecurity capabilities. The fact that it can identify vulnerabilities "at a scale beyond human capacity" while also developing exploitation methods puts it squarely in weapons territory. Anthropic's acknowledgment that it demonstrated "a potentially dangerous capability for circumventing safeguards" is refreshingly honest about the control problem they're facing.
The timing adds another layer of irony—Anthropic suffered two major data leaks in recent weeks, including internal source code for Claude Code becoming public due to "human error." A company building AI for cybersecurity can't secure its own data, yet expects us to trust them with models that can break out of sandboxes. Technical researcher Sam Bowman noted that while current versions are "less likely" to leak information, they're still "at least as capable" of circumventing containment measures.
For developers, this represents a preview of where AI security tools are heading—and the new attack surfaces they create. If you're building AI systems, start thinking about sandbox escapes as a fundamental capability, not an edge case.
