Artificial Analysis published their independent eval of GPT-5.5 today and the headline is the gap between vendor claims and third-party measurement. AA's Intelligence Index puts GPT-5.5 (xhigh) at the top by 3 points, breaking a three-way tie with prior frontier models โ leading on Terminal-Bench Hard and GDPval-AA, trailing on CritPt, second to Gemini 3.1 Pro on three benchmarks. On AA-Omniscience โ their factual-knowledge benchmark โ GPT-5.5 hits the highest accuracy at 57%, but with an 86% hallucination rate. Claude Opus 4.7 (max) sits at 36% hallucination on the same benchmark; Gemini 3.1 Pro at 50%. OpenAI's launch-day claim of "60% hallucination drop" was measured on different terrain than what AA tests, and that gap is the read builders should care about.
The methodology distinction matters. OpenAI's hallucination evaluation appears to use prompts where ground truth is well-established and the model has training-data coverage โ the "60% drop" measures improvement on a baseline OpenAI controls. AA-Omniscience targets the harder case: factual claims about obscure-but-verifiable subjects where models tend to fabricate plausible-sounding answers because they don't know what they don't know. The 86%-vs-36% gap with Opus 4.7 isn't saying GPT-5.5 is "broadly worse" at facts; it's saying GPT-5.5 fabricates more confidently when pushed past its knowledge frontier. That's a recognized trade โ higher accuracy on the easy tail can come with higher fabrication on the hard tail, especially when the post-training rewards confident-sounding answers. AA's framework with extended-thinking modes shows the mechanism: GPT-5.5 Pro extended thinking halves its hallucination rate (8.3% โ 4.2% on whichever benchmark slice; not specified which). Self-correction during reasoning is real but not enabled by default in the Instant tier OpenAI just shipped.
The cost economics are the other half of the read. AA reports pricing doubled to $5/$30 per 1M input/output tokens for GPT-5.5 vs the prior 5.4 generation. Despite ~40% fewer output tokens on the same workload, running the AA Intelligence Index costs about 20% more on 5.5. The interesting angle: GPT-5.5 medium reasoning effort matches Opus 4.7 performance at roughly one-quarter the cost (~$1,200 vs $4,800 for the Index run). For builders evaluating a router strategy โ Opus for hard problems, GPT-5.5 medium for the rest โ the economics now favor mixing more aggressively than they did under 5.4. The high-effort tier (xhigh) is where the leadership claim lives, but the medium-tier price/performance is the actual builder calculus. For ChatGPT consumers using the Instant default, none of this applies directly โ Instant is positioned for latency, not extended reasoning, and the 86% AA-Omniscience number is on the xhigh tier, not Instant.
Practical move: if you're shipping factual lookup or research-assistant flows, AA-Omniscience-style failures are the failure mode to test against, and the 50-point gap between GPT-5.5 and Opus 4.7 is large enough to matter for routing decisions. Build a small private eval set of obscure-but-verifiable factual queries (academic citations, niche technical specifications, historical specifics) and run both models โ your domain-specific gap may be different from AA's overall number, but you'll know which side to route. For coding and reasoning workloads, GPT-5.5 medium hitting Opus performance at quarter cost is a real win โ re-evaluate your routing if you've been defaulting to Opus for cost-insensitive deep tasks. The eval lesson holds beyond this release: vendor hallucination claims and independent benchmark hallucination rates measure different things, and "60% better" only means something specific to the harness it was measured on. Track both.
