Contamination: Definition & Meaning — AI Wiki

当 benchmark 测试数据出现在模型的训练数据中,使其分数膨胀而不反映真实能力。如果一个模型在训练时见过测试题而“学过答案”,它的 benchmark 表现就无意义。污染是一个日益严重的问题,因为训练数据集越来越大,抓取更多互联网内容,而 benchmark 数据常常被公开发布。

为什么重要

污染破坏 AI 行业用来比较模型的整个 benchmark 系统。一个因为记住答案而在 MMLU 上得 90% 的模型,并不比一个从未见过它们得 80% 的更聪明。当更多 benchmark 漏进训练数据,社区被迫不断创建新 benchmark,私有 held-out 评估变得比公开排行榜更重要。

Deep Dive

Contamination happens in several ways. Direct inclusion: benchmark data appears verbatim in the training corpus (often via web scraping sites that host benchmark questions). Indirect leakage: training data includes discussions about benchmark questions, model-generated solutions, or derivative content. Temporal leakage: a model is evaluated on a "new" benchmark, but the training data cutoff includes early versions of that benchmark.

Detection Is Hard

Detecting contamination isn't straightforward. You can search for exact matches of test questions in training data, but paraphrased or partial matches are harder to catch. Some researchers use membership inference attacks — checking if the model's confidence on test examples is suspiciously higher than on similar unseen examples. But these methods have false positives and negatives, and access to training data is often limited.

The Response

The community is responding in several ways: private held-out benchmarks that aren't published (like some internal evaluations at AI labs), dynamic benchmarks that generate new questions regularly, Chatbot Arena (which uses real user preferences rather than static test sets), and contamination analysis as a required part of model evaluation reports. The shift toward human evaluation and live benchmarks is partly driven by the contamination problem.

Contamination

为什么重要

Deep Dive

Detection Is Hard

The Response

相关概念