SWE-bench Verified is contaminated: OpenAI Feb audit found 59% flawed cases, Zubnet AI News

On February 23, OpenAI's Frontier Evals team published a post explaining why it had stopped reporting SWE-bench Verified scores. The audit found that 59.4% of the hardest test cases in the benchmark have fundamental flaws — tests that demand exact function names not mentioned in problem statements, or check unrelated behavior. More damning: every major frontier model tested — GPT-5.2, Claude Opus 4.5, Gemini 3 Flash — could reproduce the gold-patch solutions verbatim from memory using only the task ID. OpenAI's conclusion was direct: "Improvements on SWE-bench Verified no longer reflect meaningful improvements in models' real-world software development abilities." They recommend SWE-bench Pro instead. Three months later, the rest of the coding-agent industry is still ranking itself on the contaminated bench.

The current top-of-the-table numbers being published are: Claude Code on Opus 4.7 at 87.6%, OpenAI Codex on GPT-5.5 at roughly 88.7% (third-party tracker; OpenAI itself does not self-report), Gemini CLI at 80.6%, OpenHands at 72%, Augment Code at 70.6% self-reported on its own harness, Cursor around 51.7% on defaults, GitHub Copilot around 56%. On SWE-bench Pro — the alternative OpenAI now recommends — the same models sit much lower: Claude Opus 4.7 at 64.3%, GPT-5.5 at 58.6%. Terminal-Bench 2.0 is the other benchmark that has stayed credible: Codex at 82.7%, Claude Code at 69.4%, Gemini CLI at 68.5%. The gap between the two benchmark families is itself the signal: when an eval's scores compress models against the ceiling and another eval's scores spread them out, the second one is doing the discrimination work.

The deeper issue is the gap between benchmark-maximizing and productivity-maximizing. Agent scaffolding alone produces roughly ±17-problem variance on identical models, which means harness choices can dominate the model choice on any given run. None of the public rankings come with a published harness specification, so apples-to-apples comparisons across vendors are not actually being run — only apples-to-each-vendor's-own-numbers. The practical implication for builders is that the right comparison is not "which agent leads SWE-bench Verified" but "which agent solves my tasks on my codebase with my CI and my style conventions." The empirical method that works is running 50 to 100 tasks from your own real backlog against two or three candidates and measuring outcomes rather than scores.

The recommendation pattern that actually fits the data is a layered stack rather than a single-tool bet. Terminal agents — Claude Code or Codex — earn their cost on multi-file refactors, architectural changes, and the kind of debugging that would otherwise burn a senior engineer's afternoon. IDE extensions — Cursor or GitHub Copilot — earn theirs on inline completions, quick edits, and ambient assistance during routine work. Open-source agents — Aider, Cline, OpenHands — earn theirs when you want to swap models, avoid platform markup, or audit agent behavior end-to-end. Using more than one is not indecision; it is the honest answer to specialization. On the benchmark side: SWE-bench Verified is not your friend any more. SWE-bench Pro, Terminal-Bench 2.0, and your own codebase are.

SWE-bench Verified is contaminated: OpenAI Feb audit found 59% flawed cases

More News