Microsoft MDASH ने 100+ agent pipeline भेजी code audit के लिए, CyberGym पर 88.45%

Microsoft ने इस हफ़्ते MDASH introduce किया — एक multi-model agentic platform जो 100 से ज़्यादा specialized AI agents use करती है, multi-stage pipeline के रूप में organize, code को scale पर audit करने के लिए। Framing पर ध्यान दो: "AI vulnerabilities ढूँढना" नहीं (InfoQ headline भ्रामक है) पर "AI agents use करके traditional code vulnerabilities ढूँढना।" Stages: scanning, debate, validation, deduplication, exploitation — हर एक एक अलग specialized agent role द्वारा handle किया जाता है, single agent द्वारा सब कुछ sequentially करने के बजाय। Design से Model-agnostic। Autonomous SecOps या agent-driven code review पर काम कर रहे builders के लिए, यहाँ की architectural taxonomy ही takeaway है, platform तक access चाहे जैसा भी हो।

Numbers concrete हैं। CyberGym पर, 1,507 real-world vulnerabilities का public benchmark, MDASH **88.45%** score करता है। Microsoft के internal historical case sets पर: Windows Common Log File System driver (`clfs.sys`) vulnerabilities पर **96% recall**, TCP/IP stack (`tcpip.sys`) cases पर **100% recall**। Tested codebases Windows, Hyper-V और Azure हैं — बड़े, mature, भारी ऑडिटेड C और C++ systems जहाँ novel CVEs ढूँढना मुश्किल है और recall meaningful metric है। ईमानदारी से reported limitation: orchestration risk, specifically "single misconfigured permission boundary का blast radius।" 100+ agents एक दूसरे से और source trees से बात कर रहे हैं — clean रखने के लिए बहुत सारा trust boundary है, और platform इसे acknowledge करती है।

Ecosystem read: यह agent-infrastructure thesis security research पर applied है, और architectural pattern Microsoft के specific implementation तक access के बिना reproducible है। "Per pipeline stage specialized agent" split — scan, debate, validate, dedupe, exploit — किसी भी multi-agent system के लिए useful template है जो single-agent context-explosion ceiling से टकराता है। इस महीने की अन्य agent infra के साथ naturally pair करता है: Google Genkit middleware का 3-hook-point composition, Tencent का 4-tier memory pyramid, पहले का Dreadnode red-team agent काम। Shape same है — agent loop को composable specialized stages में तोड़ना, सब कुछ करने की कोशिश करने वाली एक monolithic loop के बजाय। Microsoft का specific contribution explicit *debate* और *validation* stages हैं, जिन्हें ज़्यादातर published agent harness एक साथ collapse करते हैं।

Monday सुबह: MDASH ख़ुद "internal Microsoft + selected customers के साथ limited private preview" है — कोई GitHub repo नहीं, कोई license नहीं, ज़्यादातर builders के लिए कोई public access नहीं। आज usable architectural template है। अगर तुम autonomous code-audit agent build कर रहे हो, तुम्हारा minimum viable pipeline होना चाहिए: scanner agent जो candidates propose करे, debate agent जो हर candidate के दोनों sides argue करे, validation agent जो concrete tests चलाए, dedupe agent जो semantically-equivalent findings merge करे, exploitation agent जो survivors के लिए PoCs produce करे। 88.45% CyberGym वो bar है जिसे beat करना है अगर तुम इसे reproduce करो। ईमानदार unknowns: MDASH के अंदर Microsoft कौनसे underlying models use कर रहा है, per-stage success rates individually क्या हैं, और headline number का कितना हिस्सा pipeline है vs model strength।

Microsoft MDASH ने 100+ agent pipeline भेजी code audit के लिए, CyberGym पर 88.45%

और समाचार