MIT Technology Review's AI newsletter framed today's piece around what it called the "underpants gnomes" problem: Step 1 is build the model, Step 3 is enterprise transformation, Step 2 is mostly a hand-wave. The concrete evidence behind the framing is Mercor's APEX-Agents benchmark, which Will Douglas Heaven cited and which deserves its own attention. APEX comprises 480 professional tasks built by experts with 10+ years at top investment banks, management consulting firms, and corporate law practices. The agents work inside 33 simulated "worlds," each a complete Google Workspace environment with Slack threads, Drive files, spreadsheets, and PDFs that the model has to actually navigate, not a stripped-down API benchmark. The leaderboard as of last week: GPT-5.5 (xhigh) at 37.7%, GPT-5.4 (xhigh) at 33.3%, Claude Opus 4.6 at 33.0%, Gemini 3.1 Pro Preview at 32.0%. Mercor's own conclusion: no model is ready to replace a professional end-to-end. The MITTR framing is harder: this is the data point the AI-replaces-work narrative has been allergic to.
The technical reality the benchmark surfaces is that frontier models are converging in capability while still failing two out of three real workplace tasks. The 1.3-percentage-point spread between the top three labs is striking on its own; we are at the point where lab-to-lab differentiation matters less than the absolute capability ceiling on multi-step professional work. The tasks APEX measures are not toy benchmarks like MMLU or even SWE-Bench; they are concrete deliverables a junior banker, lawyer, or consultant would be assigned in their first two years, embedded in the messy real Workspace context where you have to find the right spreadsheet, parse the unstructured Slack thread, cross-reference the PDF, and produce an output that another professional would accept. Models excel in the planning and research substeps, which matches the existing literature, but fail on what Mercor calls strategic judgment calls: the parts of the work where the answer depends on knowing what the firm or the client actually wants, which is not in any document. That is consistent with another study cited in the MITTR piece, where Anthropic predicted job-disruption likelihoods based on task analysis but had to acknowledge it does not measure what happens when the agent is dropped into a real workflow with real coworkers and real institutional context.
The broader implication is uncomfortable for both the AI bull case and the AI bear case, which is part of why the data is worth taking seriously. Bulls extrapolate from chat benchmarks and demos to "agents will replace knowledge workers in 18 months"; APEX says current frontier agents cannot complete most of one junior banker's actual day. Bears extrapolate from current failures to "this whole thing is a bubble"; APEX also shows GPT-5.5 jumping from 33.3 to 37.7 in a single iteration, which is a meaningful capability jump on tasks that resist gaming. The honest reading is the one Mercor publishes alongside the leaderboard: foundation models are getting steadily better at this kind of work, the rate of improvement is real, and the gap to professional-grade end-to-end completion is also real and not closing in the next quarter. The MITTR call for "fewer guesses and more evidence, transparency from model makers, coordination between researchers and businesses, new ways to evaluate this technology" is essentially a request for more APEX-style benchmarks. Right now there are not many; APEX, OSWorld, TAU-Bench, and a handful of others are doing the load-bearing work that ARC, MMLU, and HumanEval did for the previous generation.
For builders shipping agentic products into enterprise, the actionable read is to treat APEX scores as a sanity check rather than a marketing proof point. If a frontier model passes one in three tasks in a Workspace-equivalent environment, your agent in production will look similar unless you have built domain-specific scaffolding (verifiers, retrieval, narrow tool sets) that materially reduces the task surface. The labs that ship agents claiming high enterprise success rates are almost always reporting on a much narrower task distribution than APEX measures, and the difference is the gap MITTR is calling missing. Three concrete suggestions: first, when you evaluate agents internally, build your own version of the messy-Workspace setup, not a clean API harness; performance differences of 30-40 percentage points are routine between the two. Second, design your product around the strategic-judgment failure mode: keep humans in the loop on the parts where the answer depends on context the agent cannot see, automate the research-and-draft substeps where models actually do well. Third, expect the leaderboard to keep climbing; planning your roadmap around a 60-70% APEX score in 18 months is more reasonable than either replacement-in-2026 or never. The real story is in Step 2, and APEX is the closest thing the field has to a measurement of how far along that step we actually are.
