Zubnet AI学习Wiki › AI Benchmarks
基础

AI Benchmarks

MMLU, HumanEval, ARC, HellaSwag
用来测量和比较 AI 模型能力的标准化测试。MMLU 测试 57 个学术科目的知识。HumanEval 测试代码生成。ARC 测试科学推理。HellaSwag 测试常识推理。GSM8K 测试数学。Benchmark 分数为比较模型提供通用语言,尽管有显著局限。

为什么重要

Benchmark 是行业记分的方式。当 Anthropic 说 Claude 在 MMLU 上得 X%、在 HumanEval 上得 Y%,这些数字只在你知道 benchmark 测什么、怎么评分、局限是什么时才有意义。理解 benchmark 能帮你穿透营销说辞,评估哪个模型真正最适合你的具体用例。

Deep Dive

Key benchmarks: MMLU (Massive Multitask Language Understanding) — 14,042 multiple-choice questions across 57 subjects from STEM to humanities. HumanEval — 164 coding problems testing function generation in Python. ARC (AI2 Reasoning Challenge) — science exam questions requiring reasoning. HellaSwag — sentence completion testing commonsense knowledge. GSM8K — 8,500 grade-school math word problems. Each tests a different capability.

Why Benchmarks Are Problematic

Several issues: contamination (test questions appear in training data, inflating scores), saturation (when all models score 95%+, the benchmark stops discriminating), gaming (training specifically to maximize benchmark scores without genuinely improving capability), and narrow coverage (benchmarks test what's testable, not necessarily what matters to users). A model that scores 90% on MMLU might be worse at actually helping a user than one that scores 80% but follows instructions better.

The Move Beyond Static Benchmarks

The field is evolving: Chatbot Arena uses real-time human preferences (hard to contaminate, always current). LiveBench uses frequently refreshed questions. SEAL and other private benchmarks keep test data secret. Task-specific evaluations (SWE-bench for real GitHub issue solving, GPQA for PhD-level science) test capabilities that general benchmarks miss. The trend is toward evaluation that looks more like real-world use and less like standardized testing.

相关概念

← 所有术语
← AGI AI Coding Assistants →