ARC-AGI-3 drops frontier models below 1%, reality check for AGI claims

François Chollet's ARC Prize Foundation just released ARC-AGI-3, and it's brutal. The new version of the interactive reasoning benchmark that humans solve 100% of the time has dropped every frontier AI model below 1%. Google's Gemini Pro leads the humbling scoreboard at 0.37%, followed by GPT 5.4 High at 0.26%, Claude Opus at 0.25%, and Grok scoring a flat zero. These are game-like scenarios with zero instructions where models must discover rules, form goals, and execute strategies entirely from scratch.

This reset matters because it punctures the AGI hype cycle at exactly the right moment. Labs burned millions training specifically on ARC-AGI-2, pushing scores from 3% to around 50% in under a year — only to get knocked back to nearly zero by V3. Chollet designed this intentionally to separate genuine reasoning from expensive pattern matching and brute force optimization. The $1 million prize backing the challenge has frontier labs paying far more attention than they did to earlier versions.

What's most revealing is the pattern. Every ARC release triggers the same cycle: models get embarrassed, labs throw resources at the problem, scores climb rapidly, then a new version resets everything. Whether the eventual score improvements on V3 will represent actual reasoning breakthroughs or just more sophisticated memorization is exactly what Chollet built this to expose. For developers betting on model reasoning capabilities, ARC-AGI-3 is the reality check your product roadmap needs.

ARC-AGI-3 drops frontier models below 1%, reality check for AGI claims

More News