Microsoft MDASH ships 100+ agent pipeline for code audit, 88.45% CyberGym

Microsoft introduced MDASH this week — a multi-model agentic platform that uses more than 100 specialized AI agents organized as a multi-stage pipeline to audit code at scale. Note the framing: not "finding AI vulnerabilities" (the InfoQ headline is misleading) but "finding traditional code vulnerabilities using AI agents." Stages: scanning, debate, validation, deduplication, exploitation — each handled by a different specialized agent role rather than a single agent doing all of it sequentially. Model-agnostic by design. For builders working on autonomous SecOps or agent-driven code review, the architectural taxonomy here is the takeaway, regardless of access to the platform itself.

The numbers are concrete. On CyberGym, the public benchmark of 1,507 real-world vulnerabilities, MDASH scores **88.45%**. On Microsoft's internal historical case sets: **96% recall** on Windows Common Log File System driver (`clfs.sys`) vulnerabilities, **100% recall** on TCP/IP stack (`tcpip.sys`) cases. The codebases tested are Windows, Hyper-V and Azure — large, mature, heavily audited C and C++ systems where finding novel CVEs is hard and recall is the meaningful metric. The reported limitation that comes through honestly: orchestration risk, specifically "blast radius of a single misconfigured permission boundary." 100+ agents talking to each other and to source trees is a lot of trust boundary to keep clean, and the platform acknowledges it.

Ecosystem read: this is the agent-infrastructure thesis applied to security research, and the architectural pattern is reproducible without access to Microsoft's specific implementation. The "specialized agent per pipeline stage" split — scan, debate, validate, dedupe, exploit — is a useful template for any multi-agent system that hits the single-agent context-explosion ceiling. Pairs naturally with this month's other agent infra: Google Genkit middleware's 3-hook-point composition, Tencent's 4-tier memory pyramid, the Dreadnode red-team agent work earlier. The shape is the same — break the agent loop into composable specialized stages instead of one monolithic loop trying to do everything. Microsoft's specific contribution is the explicit *debate* and *validation* stages, which most published agent harnesses collapse together.

Monday morning: MDASH itself is "internal Microsoft + limited private preview with selected customers" — no GitHub repo, no license, no public access for most builders. What's usable today is the architectural template. If you're building an autonomous code-audit agent, your minimum viable pipeline should be: scanner agent that proposes candidates, debate agent that argues both sides of each candidate, validation agent that runs concrete tests, dedupe agent that merges semantically-equivalent findings, exploitation agent that produces PoCs for the survivors. The CyberGym 88.45% is the bar to beat if you reproduce this. The honest unknowns: which underlying models Microsoft is using inside MDASH, what the per-stage success rates are individually, and how much of the headline number is the pipeline versus model strength.

Microsoft MDASH ships 100+ agent pipeline for code audit, 88.45% CyberGym

More News