Benchmark: Definition & Meaning — AI Wiki

A standardized test used to evaluate and compare AI models. Benchmarks measure specific capabilities — reasoning (ARC), math (GSM8K), coding (HumanEval), general knowledge (MMLU) — and produce scores that can be compared across models.

Why it matters

Benchmarks are how the industry keeps score, but they're imperfect. Models can be trained to ace benchmarks without being genuinely better. Real-world performance often tells a different story. Treat them as signals, not truth.

Benchmark

Why it matters

Related Concepts