OpenAI ships GPT-5.5 ('Spud') one week after GPT-5.4: 82.7% on Terminal-Bench 2.0 narrowly beats Mythos Preview, 73.1% on Expert-SWE, matches GPT-5.4 latency with fewer tokens

OpenAI shipped GPT-5.5 today, seven days after GPT-5.4 — the fastest release cadence the company has run in the 5-series. Greg Brockman framed it as a step toward the "superapp" thesis OpenAI has been telegraphing since last quarter, and VP of Research Amelia Glaese called it "our strongest model yet on coding." The model ships to ChatGPT Plus, Pro, Business, and Enterprise immediately, with GPT-5.5 Pro going to the top three tiers. Axios reports the internal codename is "Spud."

The headline numbers are Terminal-Bench 2.0 at 82.7% (up from GPT-5.4's 75.1%) and the Expert-SWE internal coding eval at 73.1% (up from 68.5%). VentureBeat's framing caught the most interesting comparison: on Terminal-Bench 2.0 specifically, GPT-5.5 narrowly beats Anthropic's Mythos Preview. That is notable because Mythos is the restricted research-preview model Anthropic has not made generally available; GPT-5.5 is shipping to ChatGPT users today. The practical claim that actually matters for serving economics is in OpenAI's release notes: GPT-5.5 matches GPT-5.4's per-token latency while completing tasks with fewer tokens. If that holds on production workloads, it is a straight cost-per-completion improvement at the same throughput ceiling.

The cadence is the pattern. GPT-5.4 shipped April 16, the same day Anthropic's Opus 4.7 went generally available. GPT-5.5 is April 23, one week later, narrowly beating Mythos on one benchmark that is itself not available generally. The model-release tempo that used to be months is now weeks, and each release lands with selective benchmarks that position against whichever competitor released most recently. For anyone building on OpenAI, the velocity cuts both ways: new capabilities arrive faster, and the model you built against two weeks ago may no longer be the default option when your users hit it.

Three concrete notes for builders. One, if you ship agentic workflows on ChatGPT or API, the per-token efficiency claim is the lever worth testing against your workload first; Terminal-Bench 2.0 and Expert-SWE benchmarks are not your workload. Two, the "coding and tool use end-to-end" framing in OpenAI's release (writing and debugging code, researching online, analyzing data, creating documents and spreadsheets, operating software, moving across tools until a task is finished) matches the convergence toward the Claude Code/Gemini CLI/Cursor feature surface we have been tracking all month. Three, versioning discipline now matters more. GPT-5.4 to GPT-5.5 is a seven-day delta. Pin the model string you depend on.

OpenAI ships GPT-5.5 ('Spud') one week after GPT-5.4: 82.7% on Terminal-Bench 2.0 narrowly beats Mythos Preview, 73.1% on Expert-SWE, matches GPT-5.4 latency with fewer tokens

More News