Zubnet AIसीखेंWiki › Model Collapse
Training

Model Collapse

Data Feedback Loop
वो degradation जो तब होती है जब AI models को previous AI models द्वारा generated data पर train किया जाता है, एक feedback loop create करते हुए जहाँ errors और biases generations के across accumulate होते हैं। हर generation कुछ diversity खोती है और पिछली generation के कुछ artifacts amplify करती है, eventually ऐसे models produce करती है जो repetitive, generic, या distorted outputs generate करें।

यह क्यों matter करता है

Model collapse AI-generated content era का ticking time bomb है। जब internet AI-generated text से भर रहा है (new web content का अनुमानित 10–50%), web scrapes पर trained future models inevitably AI outputs ingest करेंगे। अगर इसे carefully manage नहीं किया गया, model quality plateau हो सकती है या degrade हो सकती है। यही वजह है कि data curation और provenance tracking critical infrastructure बन रहे हैं।

Deep Dive

The mechanism: a model trained on real data captures the distribution imperfectly — it overestimates some patterns and misses others. When a second model trains on the first model's outputs, it captures the first model's imperfect distribution, amplifying the errors. By generation 5 or 10, the distribution has collapsed to a narrow, distorted version of the original. Shumailov et al. (2023) demonstrated this empirically across multiple model types.

The Internet Contamination Problem

The practical concern: pre-training datasets are typically scraped from the web, and the web increasingly contains AI-generated content. If 20% of a training corpus is AI-generated, and that AI content has the same statistical biases as the model being trained, those biases get reinforced. The result isn't catastrophic failure but gradual homogenization — models that sound more and more like each other and less like the diversity of human expression.

Mitigations

Solutions include: detecting and filtering AI-generated content from training data (hard at scale), mixing AI-generated data with verified human data (maintaining a "human data floor"), watermarking AI outputs to enable filtering, and maintaining curated, AI-free reference datasets. Some researchers argue that model collapse is overstated if data is properly diversified and quality-controlled, but the risk is taken seriously enough that major labs invest in data provenance.

संबंधित अवधारणाएँ

← सभी Terms
← Model Card Model Merging →