Dataset: Definition & Meaning — AI Wiki

用於訓練、評估或測試機器學習模型的結構化資料集合。資料集可以是有標籤的(每個範例都有已知正確答案)或無標籤的(沒有註解的原始資料)。一個資料集的品質、大小、多樣性和代表性,根本性地決定了一個模型能學到什麼。

為什麼重要

垃圾進,垃圾出。在壞資料集上訓練的最優雅的架構,也會產出糟糕結果。反過來,在優秀資料上訓練的簡單模型,往往勝過在雜訊上訓練的複雜模型。資料集策展可以說是 AI 開發中最有影響力、也最不光鮮的部分。

Deep Dive

Datasets come in many forms: text corpora for language models, labeled images for classifiers, question-answer pairs for fine-tuning, preference pairs for alignment, and benchmark datasets for evaluation. The distinction between training set (what the model learns from), validation set (what guides hyperparameter tuning), and test set (what measures final performance) is fundamental — evaluating on training data is meaningless because the model has memorized it.

The Data Scaling Story

LLM pre-training datasets have grown from millions of tokens (early GPT) to trillions (modern models). Common Crawl, Wikipedia, books, code repositories, scientific papers, and curated web text form the typical mix. But more data isn't always better — the Chinchilla scaling laws showed that data quality and quantity must scale together with model size. Deduplication, filtering toxic or low-quality content, and balancing domains are all critical steps.

Bias Lives in the Data

Every dataset carries the biases of its sources. A model trained mostly on English web text will perform worse on other languages. A dataset scraped from the internet inherits society's prejudices. This isn't a problem you can fix with architecture — it requires careful data curation, auditing, and post-training mitigation. The most impactful AI ethics work often happens at the dataset level.

Dataset

為什麼重要

Deep Dive

The Data Scaling Story

Bias Lives in the Data

相關概念