GPT-5.5 AA Intelligence Index में top; AA-Omniscience पर 86% hallucination

Artificial Analysis ने आज GPT-5.5 का independent eval publish किया और headline vendor claims और third-party measurement के बीच का gap है। AA Intelligence Index GPT-5.5 (xhigh) को 3 points से top पर रखता है, पिछले frontier models के साथ three-way tie तोड़ते हुए — Terminal-Bench Hard और GDPval-AA पर leading, CritPt पर trailing, तीन benchmarks पर Gemini 3.1 Pro के बाद second। AA-Omniscience पर — उनके factual-knowledge benchmark — GPT-5.5 highest accuracy 57% पर hit करता है, पर 86% hallucination rate के साथ। Claude Opus 4.7 (max) उसी benchmark पर 36% hallucination पर बैठता है; Gemini 3.1 Pro 50% पर। OpenAI के launch-day «60% hallucination drop» claim को AA जो test करता है उससे अलग terrain पर measure किया गया था, और वो gap वो reading है जो builders को मायने रखनी चाहिए।

methodology का distinction मायने रखता है। OpenAI का hallucination evaluation apparently उन prompts का use करता है जहाँ ground truth well-established है और model के पास training-data coverage है — «60% drop» उस baseline पर improvement measure करता है जिसे OpenAI control करता है। AA-Omniscience harder case को target करता है: obscure-but-verifiable subjects के बारे में factual claims जहाँ models plausible-sounding answers fabricate करते हैं क्योंकि वो नहीं जानते कि वो क्या नहीं जानते। Opus 4.7 के साथ 86%-vs-36% gap ये नहीं कह रहा कि GPT-5.5 facts पर «broadly worse» है; ये कह रहा है कि GPT-5.5 अपने knowledge frontier से परे push किए जाने पर ज़्यादा confidently fabricate करता है। ये एक recognized trade है — easy tail पर higher accuracy hard tail पर higher fabrication के साथ आ सकती है, ख़ासकर जब post-training confident-sounding answers reward करता है। AA का extended-thinking modes वाला framework mechanism दिखाता है: GPT-5.5 Pro extended thinking अपनी hallucination rate को आधा करता है (8.3% → 4.2% किसी benchmark slice पर; specify नहीं किया कौन-सी)। reasoning के दौरान self-correction real है पर OpenAI ने जो Instant tier ship किया उसमें default पर enable नहीं।

cost economics reading का दूसरा आधा है। AA report करता है pricing पिछली 5.4 generation के against $5/$30 per 1M input/output tokens तक double हुई। same workload पर ~40% कम output tokens होने के बावजूद, AA Intelligence Index चलाने का cost 5.5 पर लगभग 20% ज़्यादा है। interesting angle: GPT-5.5 medium reasoning effort लगभग एक चौथाई cost पर Opus 4.7 performance match करता है (~$1,200 vs $4,800 Index run के लिए)। routing strategy evaluate करने वाले builders के लिए — hard problems के लिए Opus, बाक़ी के लिए GPT-5.5 medium — economics अब 5.4 के मुक़ाबले aggressively mix करने को favor करती है। high-effort tier (xhigh) वहाँ है जहाँ leadership claim रहता है, पर medium-tier price/performance actual builder calculus है। default Instant use करने वाले ChatGPT consumers के लिए, इसमें से कुछ भी directly apply नहीं होता — Instant latency के लिए positioned है, extended reasoning के लिए नहीं, और 86% AA-Omniscience number xhigh tier पर है, Instant पर नहीं।

practical move: अगर आप factual lookup या research-assistant flows ship कर रहे हो, AA-Omniscience-style failures वो failure mode हैं जिनके against test करना है, और GPT-5.5 और Opus 4.7 के बीच 50-point gap routing decisions के लिए मायने रखने को काफ़ी बड़ा है। obscure-but-verifiable factual queries (academic citations, niche technical specifications, historical specifics) का छोटा private eval set बनाओ और दोनों models चलाओ — आपका domain-specific gap AA के overall number से अलग हो सकता है, पर आप जानोगे किस तरफ़ route करना है। coding और reasoning workloads के लिए, GPT-5.5 medium का चौथाई cost पर Opus performance hit करना real win है — अगर आप cost-insensitive deep tasks के लिए Opus पर default कर रहे थे तो अपना routing re-evaluate करो। eval lesson इस release से परे है: vendor hallucination claims और independent benchmark hallucination rates अलग चीज़ें measure करती हैं, और «60% better» सिर्फ़ उस harness के लिए कुछ specific मतलब रखता है जिस पर ये measured था। दोनों track करो।

GPT-5.5 AA Intelligence Index में top; AA-Omniscience पर 86% hallucination

और समाचार