Zubnet AIAprenderWiki › Synthetic Data
Training

Synthetic Data

AI-Generated Training Data
Datos de entrenamiento generados por modelos IA en vez de recolectados de fuentes del mundo real. Un modelo grande genera ejemplos que luego se usan para entrenar o fine-tunear otros modelos. Esto puede incluir pares pregunta-respuesta sintéticos, conversaciones sintéticas, código sintético, o versiones aumentadas de datos reales. Se está volviendo parte estándar del pipeline de entrenamiento para la mayoría de empresas IA.

Por qué importa

Los datos sintéticos están remodelando el desarrollo IA porque los datos etiquetados del mundo real son caros, lentos de recolectar, y a veces imposibles de conseguir (casos médicos raros, eventos raros, dominios sensibles a privacidad). Cuando un modelo frontier puede generar 10 millones de ejemplos de entrenamiento de la noche a la mañana, la economía de recolección de datos cambia fundamentalmente. Pero el control de calidad es crítico — entrenar en datos sintéticos malos amplifica errores.

Deep Dive

The uses of synthetic data span the entire training pipeline. For pre-training, synthetic data can fill gaps in underrepresented domains or languages. For fine-tuning, frontier models generate instruction-following examples that teach smaller models specific skills. For alignment, models generate responses that are then ranked by humans or other models. For evaluation, synthetic benchmarks test capabilities that natural benchmarks don't cover.

Model Collapse

A key risk: if you train models on too much synthetic data from previous models, errors accumulate across generations. This is called "model collapse" — each generation loses some diversity and amplifies some biases from the previous one. The result is models that produce increasingly generic, repetitive, or distorted outputs. The research consensus is that synthetic data works best when mixed with real data and when quality is carefully filtered.

The Legality Question

Using synthetic data raises legal and ethical questions. If Model A generates training data and Model B is trained on it, does Model B inherit any IP issues from Model A's training data? Most model providers' terms of service address this — some allow it (Llama's license permits), some restrict it (OpenAI's terms historically prohibited training competing models on their outputs). The legal landscape is still evolving, but synthetic data is now so pervasive that the industry largely treats it as a standard practice with provider-specific restrictions.

Conceptos relacionados

← Todos los términos
← Sycophancy System Prompt →