Model Collapse: Definition & Meaning — AI Wiki

La degradación que ocurre cuando modelos IA son entrenados en datos generados por modelos IA previos, creando un bucle de retroalimentación donde los errores y sesgos se acumulan a través de generaciones. Cada generación pierde alguna diversidad y amplifica algunos artefactos de la anterior, eventualmente produciendo modelos que generan salidas repetitivas, genéricas o distorsionadas.

Por qué importa

El model collapse es la bomba de tiempo de la era del contenido generado por IA. Mientras internet se llena con texto generado por IA (estimado en 10–50% del nuevo contenido web), los futuros modelos entrenados en scrapes web inevitablemente ingerirán salidas IA. Si esto no se gestiona cuidadosamente, la calidad de modelos podría mesetar o degradarse. Es por qué la curación de datos y el tracking de procedencia se están volviendo infraestructura crítica.

Deep Dive

The mechanism: a model trained on real data captures the distribution imperfectly — it overestimates some patterns and misses others. When a second model trains on the first model's outputs, it captures the first model's imperfect distribution, amplifying the errors. By generation 5 or 10, the distribution has collapsed to a narrow, distorted version of the original. Shumailov et al. (2023) demonstrated this empirically across multiple model types.

The Internet Contamination Problem

The practical concern: pre-training datasets are typically scraped from the web, and the web increasingly contains AI-generated content. If 20% of a training corpus is AI-generated, and that AI content has the same statistical biases as the model being trained, those biases get reinforced. The result isn't catastrophic failure but gradual homogenization — models that sound more and more like each other and less like the diversity of human expression.

Mitigations

Solutions include: detecting and filtering AI-generated content from training data (hard at scale), mixing AI-generated data with verified human data (maintaining a "human data floor"), watermarking AI outputs to enable filtering, and maintaining curated, AI-free reference datasets. Some researchers argue that model collapse is overstated if data is properly diversified and quality-controlled, but the risk is taken seriously enough that major labs invest in data provenance.

Model Collapse

Por qué importa

Deep Dive

The Internet Contamination Problem

Mitigations

Conceptos relacionados