Synthetic Data: Definition & Meaning — AI Wiki

Dados de treinamento gerados por modelos IA em vez de coletados de fontes do mundo real. Um modelo grande gera exemplos que então são usados para treinar ou fazer fine-tuning de outros modelos. Isso pode incluir pares pergunta-resposta sintéticos, conversas sintéticas, código sintético, ou versões aumentadas de dados reais. Está se tornando uma parte padrão do pipeline de treinamento para a maioria das empresas IA.

Por que importa

Dados sintéticos estão remodelando o desenvolvimento IA porque dados rotulados do mundo real são caros, lentos de coletar, e às vezes impossíveis de conseguir (casos médicos raros, eventos raros, domínios sensíveis a privacidade). Quando um modelo frontier pode gerar 10 milhões de exemplos de treinamento da noite para o dia, a economia de coleta de dados muda fundamentalmente. Mas controle de qualidade é crítico — treinar em dados sintéticos ruins amplifica erros.

Deep Dive

The uses of synthetic data span the entire training pipeline. For pre-training, synthetic data can fill gaps in underrepresented domains or languages. For fine-tuning, frontier models generate instruction-following examples that teach smaller models specific skills. For alignment, models generate responses that are then ranked by humans or other models. For evaluation, synthetic benchmarks test capabilities that natural benchmarks don't cover.

Model Collapse

A key risk: if you train models on too much synthetic data from previous models, errors accumulate across generations. This is called "model collapse" — each generation loses some diversity and amplifies some biases from the previous one. The result is models that produce increasingly generic, repetitive, or distorted outputs. The research consensus is that synthetic data works best when mixed with real data and when quality is carefully filtered.

The Legality Question

Using synthetic data raises legal and ethical questions. If Model A generates training data and Model B is trained on it, does Model B inherit any IP issues from Model A's training data? Most model providers' terms of service address this — some allow it (Llama's license permits), some restrict it (OpenAI's terms historically prohibited training competing models on their outputs). The legal landscape is still evolving, but synthetic data is now so pervasive that the industry largely treats it as a standard practice with provider-specific restrictions.

Synthetic Data

Por que importa

Deep Dive

Model Collapse

The Legality Question

Conceitos relacionados