Most AI benchmarks follow a simple formula: give the model a set of questions or tasks with known correct answers, run inference, and compute an accuracy score. MMLU, for example, is essentially a multiple-choice exam spanning 57 subjects from abstract algebra to world religions. HumanEval asks the model to write Python functions that pass unit tests. GSM8K presents grade-school math word problems. The benchmark score is the percentage the model gets right, sometimes weighted, sometimes broken down by category. Under the hood, many benchmarks evaluate models in a zero-shot or few-shot setting — meaning the model gets no examples, or just a handful, before answering. This is supposed to measure genuine capability rather than pattern-matching on a specific format.
The history of benchmarks in AI is a story of goalposts moving faster than anyone expected. GLUE, released in 2018, was supposed to be a hard test of language understanding. Models surpassed human baselines within a year, so SuperGLUE arrived in 2019. That fell too. MMLU (2020) was designed to last longer, and it did — for a while. By late 2024, frontier models were scoring above 90% on it, and the community had already moved on to harder tests like MMLU-Pro and GPQA (a set of PhD-level science questions where even domain experts struggle). This cycle of create-saturate-replace is one of the defining patterns of modern AI research.
The biggest gotcha with benchmarks is contamination. If the benchmark questions appear in the training data — which is almost inevitable when you train on most of the internet — the model might be recalling answers rather than reasoning. Some teams go further, deliberately or accidentally optimizing for specific benchmarks during training, a practice sometimes called "teaching to the test." This is why you sometimes see a model with impressive MMLU scores produce mediocre results in actual conversation. Projects like Chatbot Arena take a different approach entirely: real users chat with two anonymous models and vote on which response is better. No fixed questions, no known answers — just human preference on real tasks. It correlates surprisingly poorly with traditional benchmarks for some models, which tells you something important about what those benchmarks are actually measuring.
There is also a subtler problem: benchmarks measure what is easy to measure, not necessarily what matters. Factual recall and multiple-choice reasoning are straightforward to score automatically. Qualities like helpfulness, nuance, knowing when to say "I don't know," and maintaining coherence over a long conversation are much harder to quantify. This is why serious practitioners look at a basket of benchmarks alongside qualitative testing on their own use cases. A model that scores 2% lower on MMLU but handles your specific domain noticeably better is the better model — for you. The numbers are a starting point for comparison, not a final verdict.