Synthetic Data: Definition & Meaning — AI Wiki

由 AI 模型生成而不是從現實世界來源收集的訓練資料。一個大模型生成範例,然後被用來訓練或 fine-tune 其他模型。這可以包括合成問答對、合成對話、合成程式、或真實資料的增強版本。它正成為大多數 AI 公司訓練管線的標準部分。

為什麼重要

合成資料正在重塑 AI 開發,因為現實世界有標籤資料昂貴、收集慢、有時不可能獲得(罕見醫療案例、罕見事件、隱私敏感領域)。當一個前沿模型能一夜生成 1000 萬訓練範例,資料收集的經濟學從根本上改變。但品質控制至關重要 — 在壞合成資料上訓練放大錯誤。

Deep Dive

The uses of synthetic data span the entire training pipeline. For pre-training, synthetic data can fill gaps in underrepresented domains or languages. For fine-tuning, frontier models generate instruction-following examples that teach smaller models specific skills. For alignment, models generate responses that are then ranked by humans or other models. For evaluation, synthetic benchmarks test capabilities that natural benchmarks don't cover.

Model Collapse

A key risk: if you train models on too much synthetic data from previous models, errors accumulate across generations. This is called "model collapse" — each generation loses some diversity and amplifies some biases from the previous one. The result is models that produce increasingly generic, repetitive, or distorted outputs. The research consensus is that synthetic data works best when mixed with real data and when quality is carefully filtered.

The Legality Question

Using synthetic data raises legal and ethical questions. If Model A generates training data and Model B is trained on it, does Model B inherit any IP issues from Model A's training data? Most model providers' terms of service address this — some allow it (Llama's license permits), some restrict it (OpenAI's terms historically prohibited training competing models on their outputs). The legal landscape is still evolving, but synthetic data is now so pervasive that the industry largely treats it as a standard practice with provider-specific restrictions.

Synthetic Data

為什麼重要

Deep Dive

Model Collapse

The Legality Question

相關概念