Blitzy ने hyperscaled coding agents ship किए: SWE-Bench Pro पर 66.5%, हर run 100k calls

Blitzy ने $1.4B valuation पर $200M जुटाए जिसे वो hyperscaled agent orchestration कह रहे हैं — एक enterprise codebase का dynamic knowledge graph बनाना, फिर उस representation के against हज़ारों agents को parallel में deploy करना। headline numbers: SWE-Bench Pro पर 66.5% (harder variant, जो SWE-Bench Verified subset से अलग है जिस पर ज़्यादातर labs benchmark करते हैं), 1 million से 100 million lines of code तक के codebase support, और हर run पर Google, Anthropic, और OpenAI APIs पर 100,000+ underlying model calls। platform दस industries में दर्जनों Global 2000 enterprise customers का दावा करती है, हालाँकि specific नाम disclose नहीं हुए। architecturally ये current builder mindshare पर dominate करने वाली single-agent-CLI category से एक meaningful departure है — Cursor, Claude Code, Aider सब single agent चलाते हैं sequential reasoning करते हुए। Blitzy KG grounding के साथ fleet-scale orchestration करता है।

flag करने लायक technical architecture: existing codebase को structured knowledge graph में reverse-engineering करना fleet-scale agent work के लिए precondition है। उस representation के बिना, parallel agents एक दूसरे पर step करते हैं (वही file दो बार edited) और codebase में context खो देते हैं। KG orchestrator को work partition करने देता है — agent A auth-related changes handle करता है, agent B billing, etc. — हर agent को अपने context में पूरा codebase होने की ज़रूरत के बिना। हर run पर 100,000 model calls cost-driver और differentiator हैं: ज़्यादातर production coding agents हर task पर 50-200 calls करते हैं। 100k calls का मतलब massive parallel exploration, candidate generation, voting, और verification है, sequential chain-of-thought नहीं। इस approach पर SWE-Bench Pro पर 66.5% Sonnet 4.5 + Claude Code जो SWE-Bench Verified पर हासिल करता है (82% parallel test-time compute के साथ) उससे competitive है, पर Pro harder है तो direct comparison clean नहीं है। जो disclose नहीं हुआ: task decomposition actually कैसे काम करती है (rules-based, learned, hybrid?), latency-per-completion क्या है (parallel runs wall-clock में fast होने चाहिए, पर 100k API calls का मतलब real cost है), KG stale या ग़लत होने पर failure modes, और orchestrator agent disagreement कैसे handle करता है।

ecosystem reading: enterprise coding agents दो architectures में bifurcate हो रहे हैं। single-agent-IDE category (Cursor, Claude Code, Windsurf) developer in the seat के साथ tight loops के लिए optimize करती है — fast iteration, frequent corrections, low per-task cost। Blitzy और Devin/Cognition class «हमें एक spec दो, कल लौटो» के लिए optimize करते हैं — high per-task cost, loop में developer नहीं, पर large refactors और feature builds के लिए applicable जिन्हें single-agent setups एक session में realistically tackle नहीं कर सकते। Blitzy जो 1M-100M LOC range target करता है वो enterprise sweet spot है — codebases जो single Cursor session के लिए context में रखने को बहुत बड़े हैं, जहाँ KG-grounded approach की clear architectural justification है। 100k-API-call economics एक unit cost imply करती है हर run पर सैकड़ों से हज़ारों dollars में, जो सिर्फ़ substantive deliverables के लिए मायने रखती है, interactive editing के लिए नहीं। agent platforms evaluate करने वाले builders के लिए, सवाल बन जाता है क्या आपका काम «developer needs assistance» (single-agent) है या «specification needs implementation» (fleet) — ये substitutes नहीं, अलग categories हैं।

practical move: अगर आप 1M LOC से बड़े codebases और recurring large-refactor work वाली Global-2000-class engineering org चलाते हो, Blitzy अब eval pool में है। KG accuracy claim पर specific evidence माँगो — एक knowledge graph जो codebase को misrepresent करता है parallel agent decisions को silently corrupt करेगा ऐसे तरीक़ों से जिन्हें post-hoc debug करना महँगा है। उन्हें typical enterprise codebase scale पर cost-per-completion पर pin करो, production deployments में error rate (benchmark scores नहीं), और orchestration determinism question (एक ही input के दो runs वही patch produce करते हैं या अलग?) पर। छोटे codebases चलाने वाले builders के लिए, single-agent Cursor/Claude Code setup अभी भी सही tool है — Blitzy की architecture अपने लिए तब तक भुगतान नहीं करती जब तक parallelism advantage orchestration overhead से outweigh न करे, जो एक certain codebase size के बाद ही kick in करता है। SWE-Bench Pro 66.5% interesting है पर जब तक independent harnesses report न करें portable signal नहीं; ये एक company का number है और eval-honesty discipline कहती है replication देखो।

Blitzy ने hyperscaled coding agents ship किए: SWE-Bench Pro पर 66.5%, हर run 100k calls

और समाचार