A new benchmark designed to look like real knowledge work has produced a deflating number: the best AI model tested finishes only 3 percent of its tasks fully correctly. The benchmark, AA-Briefcase, comes from the analysis firm Artificial Analysis, and the top performer on it was Anthropic's Claude Fable 5, which managed that 3 percent full-completion rate.
What makes the benchmark hard is how lifelike its mess is. Its 91 tasks are built from thousands of fragmented source files, Slack threads, emails, meeting transcripts, and data exports, and they simulate multi-week projects where the relevant information is scattered rather than handed over cleanly. On 31 of the 91 tasks, no model exceeded 50 percent. The scoring is strict by design: a task is only counted as solved if every criterion is met, which is closer to how a manager would judge finished work than to partial-credit benchmarks.
The failure modes differ by how strong the model is. Weaker models tend to miss relevant files entirely or produce output nobody could use. Stronger models do the obvious part of the job but overlook the subtle, multi-source details that the full task depends on, which is why even the leader lands at 3 percent rather than something comfortable. Cost did not rescue performance either: spending ranged about 800-fold, from roughly 4 cents to more than 31 dollars per task, without a matching jump in results.
The point is not that AI is useless at knowledge work, because these same models clearly help with pieces of it every day. The point is the gap between the benchmarks models ace and the real, long-horizon, detail-exacting work they still cannot finish unsupervised. It fits a run of recent results, from a life-sciences benchmark the best model cleared only about a third of the time to surveys of stalled enterprise AI projects, that all point the same way. A 3 percent top score is a healthier signal than another saturated leaderboard, because it measures the part that is actually hard.
