Contamination: Definition & Meaning — AI Wiki

當 benchmark 測試資料出現在模型的訓練資料中,使其分數膨脹而不反映真實能力。如果一個模型在訓練時見過測試題而「學過答案」,它的 benchmark 表現就無意義。污染是一個日益嚴重的問題,因為訓練資料集越來越大,抓取更多網路內容,而 benchmark 資料常常被公開發布。

為什麼重要

污染破壞 AI 產業用來比較模型的整個 benchmark 系統。一個因為記住答案而在 MMLU 上得 90% 的模型,並不比一個從未見過它們得 80% 的更聰明。當更多 benchmark 漏進訓練資料,社群被迫不斷創建新 benchmark,私有 held-out 評估變得比公開排行榜更重要。

Deep Dive

Contamination happens in several ways. Direct inclusion: benchmark data appears verbatim in the training corpus (often via web scraping sites that host benchmark questions). Indirect leakage: training data includes discussions about benchmark questions, model-generated solutions, or derivative content. Temporal leakage: a model is evaluated on a "new" benchmark, but the training data cutoff includes early versions of that benchmark.

Detection Is Hard

Detecting contamination isn't straightforward. You can search for exact matches of test questions in training data, but paraphrased or partial matches are harder to catch. Some researchers use membership inference attacks — checking if the model's confidence on test examples is suspiciously higher than on similar unseen examples. But these methods have false positives and negatives, and access to training data is often limited.

The Response

The community is responding in several ways: private held-out benchmarks that aren't published (like some internal evaluations at AI labs), dynamic benchmarks that generate new questions regularly, Chatbot Arena (which uses real user preferences rather than static test sets), and contamination analysis as a required part of model evaluation reports. The shift toward human evaluation and live benchmarks is partly driven by the contamination problem.

Contamination

為什麼重要

Deep Dive

Detection Is Hard

The Response

相關概念