Zubnet AIAprenderWiki › Model Collapse
Training

Model Collapse

Data Feedback Loop
A degradação que ocorre quando modelos IA são treinados em dados gerados por modelos IA anteriores, criando um loop de feedback onde erros e vieses se acumulam através de gerações. Cada geração perde alguma diversidade e amplifica alguns artefatos da anterior, eventualmente produzindo modelos que geram saídas repetitivas, genéricas ou distorcidas.

Por que importa

Model collapse é a bomba-relógio da era do conteúdo gerado por IA. Enquanto a internet se enche com texto gerado por IA (estimado em 10–50% do novo conteúdo web), modelos futuros treinados em scrapes web inevitavelmente vão ingerir saídas IA. Se isso não for cuidadosamente gerenciado, a qualidade de modelos poderia estabilizar ou degradar. É por que curadoria de dados e rastreamento de proveniência estão se tornando infraestrutura crítica.

Deep Dive

The mechanism: a model trained on real data captures the distribution imperfectly — it overestimates some patterns and misses others. When a second model trains on the first model's outputs, it captures the first model's imperfect distribution, amplifying the errors. By generation 5 or 10, the distribution has collapsed to a narrow, distorted version of the original. Shumailov et al. (2023) demonstrated this empirically across multiple model types.

The Internet Contamination Problem

The practical concern: pre-training datasets are typically scraped from the web, and the web increasingly contains AI-generated content. If 20% of a training corpus is AI-generated, and that AI content has the same statistical biases as the model being trained, those biases get reinforced. The result isn't catastrophic failure but gradual homogenization — models that sound more and more like each other and less like the diversity of human expression.

Mitigations

Solutions include: detecting and filtering AI-generated content from training data (hard at scale), mixing AI-generated data with verified human data (maintaining a "human data floor"), watermarking AI outputs to enable filtering, and maintaining curated, AI-free reference datasets. Some researchers argue that model collapse is overstated if data is properly diversified and quality-controlled, but the risk is taken seriously enough that major labs invest in data provenance.

Conceitos relacionados

← Todos os termos
← Model Card Model Merging →