Zubnet AI學習Wiki › AI Benchmarks
基礎

AI Benchmarks

MMLU, HumanEval, ARC, HellaSwag
用來測量和比較 AI 模型能力的標準化測試。MMLU 測試 57 個學術科目的知識。HumanEval 測試程式生成。ARC 測試科學推理。HellaSwag 測試常識推理。GSM8K 測試數學。Benchmark 分數為比較模型提供通用語言,儘管有顯著侷限。

為什麼重要

Benchmark 是產業記分的方式。當 Anthropic 說 Claude 在 MMLU 上得 X%、在 HumanEval 上得 Y%,這些數字只在你知道 benchmark 測什麼、怎麼評分、侷限是什麼時才有意義。理解 benchmark 能幫你穿透行銷說辭,評估哪個模型真正最適合你的具體用例。

Deep Dive

Key benchmarks: MMLU (Massive Multitask Language Understanding) — 14,042 multiple-choice questions across 57 subjects from STEM to humanities. HumanEval — 164 coding problems testing function generation in Python. ARC (AI2 Reasoning Challenge) — science exam questions requiring reasoning. HellaSwag — sentence completion testing commonsense knowledge. GSM8K — 8,500 grade-school math word problems. Each tests a different capability.

Why Benchmarks Are Problematic

Several issues: contamination (test questions appear in training data, inflating scores), saturation (when all models score 95%+, the benchmark stops discriminating), gaming (training specifically to maximize benchmark scores without genuinely improving capability), and narrow coverage (benchmarks test what's testable, not necessarily what matters to users). A model that scores 90% on MMLU might be worse at actually helping a user than one that scores 80% but follows instructions better.

The Move Beyond Static Benchmarks

The field is evolving: Chatbot Arena uses real-time human preferences (hard to contaminate, always current). LiveBench uses frequently refreshed questions. SEAL and other private benchmarks keep test data secret. Task-specific evaluations (SWE-bench for real GitHub issue solving, GPQA for PhD-level science) test capabilities that general benchmarks miss. The trend is toward evaluation that looks more like real-world use and less like standardized testing.

相關概念

← 所有術語
← AGI AI Coding Assistants →