Contamination: Definition & Meaning — AI Wiki

Cuando datos de prueba de benchmark aparecen en los datos de entrenamiento de un modelo, inflando sus puntajes sin reflejar capacidad genuina. Si un modelo «estudió la clave de respuestas» al ver preguntas de prueba durante el entrenamiento, su rendimiento de benchmark no tiene significado. La contaminación es un problema creciente mientras los datasets de entrenamiento crecen y scrapean más de internet, donde los datos de benchmark a menudo son publicados.

Por qué importa

La contaminación socava todo el sistema de benchmarks que la industria IA usa para comparar modelos. Un modelo que obtiene 90% en MMLU porque memorizó las respuestas no es más inteligente que uno que obtiene 80% que nunca las vio. Mientras más benchmarks se filtran a datos de entrenamiento, la comunidad se ve forzada a crear constantemente nuevos benchmarks, y las evaluaciones privadas held-out se vuelven más importantes que leaderboards públicos.

Deep Dive

Contamination happens in several ways. Direct inclusion: benchmark data appears verbatim in the training corpus (often via web scraping sites that host benchmark questions). Indirect leakage: training data includes discussions about benchmark questions, model-generated solutions, or derivative content. Temporal leakage: a model is evaluated on a "new" benchmark, but the training data cutoff includes early versions of that benchmark.

Detection Is Hard

Detecting contamination isn't straightforward. You can search for exact matches of test questions in training data, but paraphrased or partial matches are harder to catch. Some researchers use membership inference attacks — checking if the model's confidence on test examples is suspiciously higher than on similar unseen examples. But these methods have false positives and negatives, and access to training data is often limited.

The Response

The community is responding in several ways: private held-out benchmarks that aren't published (like some internal evaluations at AI labs), dynamic benchmarks that generate new questions regularly, Chatbot Arena (which uses real user preferences rather than static test sets), and contamination analysis as a required part of model evaluation reports. The shift toward human evaluation and live benchmarks is partly driven by the contamination problem.

Contamination

Por qué importa

Deep Dive

Detection Is Hard

The Response

Conceptos relacionados