The interesting thing about the current crop of agentic reasoning benchmarks is how much they disagree about how good models actually are. SWE-bench Verified has frontier systems north of 80%, which sounds like solved. OSWorld, which measures computer use across applications, has them at 12.24% versus a 72.36% human baseline — basically a different planet. ARC-AGI-1 is saturated at 90%+, ARC-AGI-3 has the frontier under 1% as of March 2026. τ-bench shows under 50% single-trial success and pass^k consistency under 25%. The scores are not noise; they are measuring different things, and the gap between the most flattering benchmark and the most honest one is now the real story.

The methodology caveat that everyone publishing scores should be required to repeat: agent results are scaffold-dependent. The model is one variable. The prompt design, tool access, retry budget, execution environment, and evaluator version are all the other variables. A SWE-bench Verified score from Anthropic running their own scaffold and a SWE-bench Verified score from a third-party evaluator on the same model can differ by double digits. When a vendor cites 80%, the right next question is "with what scaffold, what tool stack, and what retry policy" — not "great, ship it." The Sierra τ-bench team made the strongest version of this point by introducing pass^k, which measures whether the agent succeeds k times in a row on the same task. The drop from pass@1 to pass^8 is brutal across every model, and that is the reliability gap that production deployments actually hit.

OSWorld is the benchmark that most cleanly exposes where the gap between demos and deployment lives. A human gets 72% on cross-application GUI tasks. The best frontier model gets 12%. That is not a benchmark in need of harder questions; that is a model class that does not yet know how to operate a computer the way a person does. Most other agentic benchmarks run in text-only or API-only environments where the agent is allowed to call clean tools — OSWorld makes it click buttons, switch windows, deal with whatever the OS throws back. The 60-point gap is the right number to pin to the wall when someone shows you a slick demo of an "AI assistant that uses your computer." Demos are scripted. OSWorld is not.

For builders, the practical reading list looks like this: SWE-bench Verified for code-repair specialization, τ-bench for reliability under repeated trials, OSWorld for computer-use grounding, GAIA for multi-step web reasoning, ARC-AGI-2 for novel visual reasoning, WebArena for navigation, AgentBench for cross-environment breadth. None of them are sufficient alone. None of them measure cost per task, safety under adversarial input, or multimodal reasoning beyond vision — those are the gaps the field still has not addressed. Pick the two or three that map to your actual product, run your own scaffold against the public eval, and treat vendor headline numbers as marketing until you reproduce them. The scoreboard is more useful as a map of what nobody has solved yet than as a victory lap for what's been claimed.