Key benchmarks: MMLU (Massive Multitask Language Understanding) — 14,042 multiple-choice questions across 57 subjects from STEM to humanities. HumanEval — 164 coding problems testing function generation in Python. ARC (AI2 Reasoning Challenge) — science exam questions requiring reasoning. HellaSwag — sentence completion testing commonsense knowledge. GSM8K — 8,500 grade-school math word problems. Each tests a different capability.
Several issues: contamination (test questions appear in training data, inflating scores), saturation (when all models score 95%+, the benchmark stops discriminating), gaming (training specifically to maximize benchmark scores without genuinely improving capability), and narrow coverage (benchmarks test what's testable, not necessarily what matters to users). A model that scores 90% on MMLU might be worse at actually helping a user than one that scores 80% but follows instructions better.
The field is evolving: Chatbot Arena uses real-time human preferences (hard to contaminate, always current). LiveBench uses frequently refreshed questions. SEAL and other private benchmarks keep test data secret. Task-specific evaluations (SWE-bench for real GitHub issue solving, GPQA for PhD-level science) test capabilities that general benchmarks miss. The trend is toward evaluation that looks more like real-world use and less like standardized testing.