Zubnet AI学习Wiki › Synthetic Data
Training

Synthetic Data

AI-Generated Training Data
由 AI 模型生成而不是从现实世界来源收集的训练数据。一个大模型生成示例,然后被用来训练或 fine-tune 其他模型。这可以包括合成问答对、合成对话、合成代码、或真实数据的增强版本。它正成为大多数 AI 公司训练管线的标准部分。

为什么重要

合成数据正在重塑 AI 开发,因为现实世界带标签数据昂贵、收集慢、有时不可能获得(罕见医疗案例、罕见事件、隐私敏感领域)。当一个前沿模型能一夜生成 1000 万训练样本,数据收集的经济学从根本上改变。但质量控制至关重要 — 在坏合成数据上训练放大错误。

Deep Dive

The uses of synthetic data span the entire training pipeline. For pre-training, synthetic data can fill gaps in underrepresented domains or languages. For fine-tuning, frontier models generate instruction-following examples that teach smaller models specific skills. For alignment, models generate responses that are then ranked by humans or other models. For evaluation, synthetic benchmarks test capabilities that natural benchmarks don't cover.

Model Collapse

A key risk: if you train models on too much synthetic data from previous models, errors accumulate across generations. This is called "model collapse" — each generation loses some diversity and amplifies some biases from the previous one. The result is models that produce increasingly generic, repetitive, or distorted outputs. The research consensus is that synthetic data works best when mixed with real data and when quality is carefully filtered.

The Legality Question

Using synthetic data raises legal and ethical questions. If Model A generates training data and Model B is trained on it, does Model B inherit any IP issues from Model A's training data? Most model providers' terms of service address this — some allow it (Llama's license permits), some restrict it (OpenAI's terms historically prohibited training competing models on their outputs). The legal landscape is still evolving, but synthetic data is now so pervasive that the industry largely treats it as a standard practice with provider-specific restrictions.

相关概念

← 所有术语
← Sycophancy System Prompt →