Datacurve DeepSWE: GPT-5.5 70%, Claude 4.7 54%, Gemini 3.1 Pro 10% — read the harness

Datacurve released DeepSWE, a long-horizon software engineering benchmark with 113 tasks across 91 repositories in 5 languages. Top scores reported: GPT-5.5 at 70%, GPT-5.4 at 56%, Claude Opus 4.7 at 54%, Gemini 3.1 Pro at 10%. The headline reads as "GPT-5.5 wins." The interesting story for builders is in the methodology page, not the leaderboard.

The benchmark's four stated advances: tasks written from scratch rather than adapted from existing PRs or commits, with a deep-swe-canary GUID embedded so contamination can be detected if the corpus leaks into pretraining; coverage spanning 91 repos and 5 languages; prompts roughly half the length of SWE-bench Pro's but solutions requiring 5.5x more code and ~2x more output tokens; hand-written verifiers that test software behavior rather than implementation details. All models run through mini-swe-agent for a common scaffold. Task examples are non-trivial — "Add XML diff, patch, and merge operations to etree," "Add trap coredump generation to wasmi," "Fix PromQL label sorting across typed and untyped values" — work that took engineers hours before the agentic era. Reasoning budget tiers asymmetric in the comparison: GPT-5.5 ran at xhigh, Claude Opus 4.7 at max, Gemini 3.1 Pro unlabeled.

Two builder-relevant reads. First: the 60-point spread between GPT-5.5 and Gemini 3.1 Pro is large enough to suspect benchmark structural bias toward one model's tool-use idiom, especially on a new eval where harness conventions matter. SWE-bench Verified scores narrowed once the field had time to rerun on multiple scaffolds; DeepSWE will likely follow the same arc. Second: Datacurve is in the data-services business, so a benchmark that ranks foundation models is also an advertisement for the company that built it. That does not invalidate the eval, but it means the leaderboard wants independent re-execution before being load-bearing. The mini-swe-agent harness choice is one scaffold — OpenHands, Aider, Claude Code-style harnesses will produce different relative orderings on the same tasks.

If you ship code-using agents Monday morning: read the methodology section of any new SWE benchmark before treating numbers as ordering. Look for canary GUID, scaffold disclosure, reasoning-budget normalization, and whether the eval lives in a Docker container you can run yourself. Bet on the methodology trend, not the leaderboard headline.

Datacurve DeepSWE: GPT-5.5 70%, Claude 4.7 54%, Gemini 3.1 Pro 10% — read the harness

More News