Blitzy ships hyperscaled coding agents: 66.5% SWE-Bench Pro, 100k calls/run

Blitzy raised $200M at a $1.4B valuation for what it's calling hyperscaled agent orchestration — building a dynamic knowledge graph of an enterprise codebase, then deploying thousands of agents in parallel against that representation. The headline numbers: 66.5% on SWE-Bench Pro (the harder variant, distinct from the SWE-Bench Verified subset most labs benchmark on), codebase support from 1 million to 100 million lines of code, and 100,000+ underlying model calls per run across Google, Anthropic, and OpenAI APIs. The platform claims dozens of Global 2000 enterprise customers across ten industries, though specific names weren't disclosed. Architecturally this is a meaningful departure from the single-agent-CLI category that dominates current builder mindshare — Cursor, Claude Code, Aider all run a single agent doing sequential reasoning. Blitzy is doing fleet-scale orchestration with KG grounding.

The technical architecture worth flagging: reverse-engineering an existing codebase into a structured knowledge graph is the precondition for fleet-scale agent work. Without that representation, parallel agents step on each other (same file edited twice) and lose context across the codebase. The KG lets the orchestrator partition work — agent A handles auth-related changes, agent B handles billing, etc. — without each agent needing the full codebase in its context. The 100,000 model calls per run is the cost-driver and the differentiator: most production coding agents make 50-200 calls per task. 100k calls means massive parallel exploration, candidate generation, voting, and verification rather than sequential chain-of-thought. SWE-Bench Pro at 66.5% on this approach is competitive with what Sonnet 4.5 + Claude Code achieves on SWE-Bench Verified (82% with parallel test-time compute), but Pro is harder so direct comparison isn't clean. What's not disclosed: how task decomposition actually works (rules-based, learned, hybrid?), what the latency-per-completion is (parallel runs ought to be fast wall-clock, but 100k API calls means real cost), error modes when the KG is stale or wrong, and how the orchestrator handles agent disagreement.

The ecosystem read: enterprise coding agents are bifurcating into two architectures. The single-agent IDE category (Cursor, Claude Code, Windsurf) optimizes for tight loops with a developer in the seat — fast iteration, frequent corrections, low per-task cost. Blitzy and the Devin/Cognition class optimize for "give us a spec, come back tomorrow" — high per-task cost, no developer in the loop, but applicable to large refactors and feature builds that single-agent setups can't realistically tackle in a sitting. The 1M-100M LOC range Blitzy targets is the enterprise sweet spot — codebases too large for a single Cursor session to hold in context, where the KG-grounded approach has clear architectural justification. The 100k-API-call economics imply a unit cost in the hundreds-to-thousands of dollars per run, which only makes sense for substantive deliverables, not interactive editing. For builders evaluating agent platforms, the question becomes whether your work is "developer needs assistance" (single-agent) or "specification needs implementation" (fleet) — these aren't substitutes, they're different categories.

Practical move: if you run a Global-2000-class engineering org with codebases past 1M LOC and recurring large-refactor work, Blitzy is in the eval pool now. Demand specific evidence on the KG accuracy claim — a knowledge graph that misrepresents the codebase will silently corrupt parallel agent decisions in ways that are expensive to debug post-hoc. Pin them on cost-per-completion at typical enterprise codebase scale, error rate in production deployments (not benchmark scores), and the orchestration determinism question (do two runs of the same input produce the same patch, or different ones?). For builders running smaller codebases, the single-agent Cursor/Claude Code setup is still the right tool — Blitzy's architecture pays for itself only when the parallelism advantage outweighs the orchestration overhead, which only kicks in past a certain codebase size. SWE-Bench Pro 66.5% is interesting but not a portable signal until independent harnesses report it; this is one company's number and the eval-honesty discipline says watch for replication.

Blitzy ships hyperscaled coding agents: 66.5% SWE-Bench Pro, 100k calls/run

More News