A standardized test used to evaluate and compare AI models. Benchmarks measure specific capabilities — reasoning (ARC), math (GSM8K), coding (HumanEval), general knowledge (MMLU) — and produce scores that can be compared across models.
Why it matters
Benchmarks are how the industry keeps score, but they're imperfect. Models can be trained to ace benchmarks without being genuinely better. Real-world performance often tells a different story. Treat them as signals, not truth.