Microsoft disclosed Tuesday that its agentic security harness, codenamed MDASH and built by the Autonomous Code Security team under VP Taesoo Kim, helped researchers discover 16 new vulnerabilities in the Windows networking and authentication stack โ€” including four critical remote-code-execution flaws. Two of the RCEs (CVE-2026-40361 and CVE-2026-40364) carry Microsoft's "more likely to be exploited" rating. The architecture matters more than the disclosure count: MDASH coordinates over 100 specialized AI agents alongside an ensemble of frontier and distilled models that "discover, debate, and validate exploitable vulnerabilities end-to-end." Kim's framing in the blog post was sharp: "The durable advantage lies in the agentic system around the model rather than any single model itself" โ€” an explicit statement that Microsoft is competing on harness design, not on whichever model happens to be inside.

The benchmark numbers split usefully into trustworthy and self-reported. The cleanest data point is CyberGym, an external 1,507-vulnerability benchmark drawn from OSS-Fuzz projects: MDASH scored 88.45%, roughly five percentage points ahead of the next-ranked system. That is the comparable measurement against other agentic vulnerability discoverers. The internal numbers are louder but want more skepticism: 96% recall on five years of confirmed MSRC vulnerabilities in clfs.sys, 100% recall in tcpip.sys, and 21-of-21 on a private Windows driver called StorageDrive with intentionally injected flaws (kernel UAFs, integer handling, IOCTL gaps, locking errors). Recall on previously-confirmed CVEs measures whether a model can rediscover known bugs, not whether it can find genuinely novel ones; and the StorageDrive benchmark, while controlled to keep training data clean, is still Microsoft grading its own work. The 16 newly-disclosed Windows vulns published this Patch Tuesday are the operational evidence that matters most โ€” known bugs, found by a system that hadn't seen them, in production code.

The ecosystem read is that agentic vulnerability discovery has crossed from research-curiosity to production-grade at three frontier labs almost simultaneously: Google DeepMind's Big Sleep last year, OpenAI's Daybreak last week, and now Microsoft's MDASH. All three pair a model ensemble with a multi-agent debate harness, and all three are now generating real CVEs against real codebases. For the defensive-security stack, this means the bottleneck has moved from "can the AI find bugs" to "can the AI ship them through responsible disclosure faster than attackers can find the same bugs." For the offensive side, the same harness pattern is available to anyone who can wire 100+ specialized agents around an open-weights model โ€” Microsoft's claim that the system, not the model, is the moat is also a tacit admission that the technique is reproducible. MDASH itself is in limited private preview for customers.

For builders: if you maintain a meaningful codebase, the question is no longer whether agentic security tools will find your bugs but which ones will get to them first โ€” your own pre-disclosure scanning or someone else's. Three concrete things to track: (1) CyberGym leaderboard movement, which is the only third-party measure with comparable systems; (2) whether Microsoft publishes the MDASH agent-debate transcripts the way DeepMind did for Big Sleep โ€” those are the actual reproducible artifact; (3) how MDASH-style tooling shows up in commodity Defender or in a separate SKU for Microsoft customers. The shift Kim flagged โ€” "AI vulnerability findings can scale" โ€” is going to land hardest on small teams that can neither afford their own agentic harness nor afford to wait for one to be productized.