Zubnet AI学习Wiki › Dataset
基础

Dataset

Training Set, Data
用于训练、评估或测试机器学习模型的结构化数据集合。数据集可以是带标签的(每个样本都有已知正确答案)或无标签的(没有注释的原始数据)。一个数据集的质量、大小、多样性和代表性,根本性地决定了一个模型能学到什么。

为什么重要

垃圾进,垃圾出。在坏数据集上训练的最优雅的架构,也会产出糟糕结果。反过来,在优秀数据上训练的简单模型,往往胜过在噪声上训练的复杂模型。数据集策展可以说是 AI 开发中最有影响力、也最不光鲜的部分。

Deep Dive

Datasets come in many forms: text corpora for language models, labeled images for classifiers, question-answer pairs for fine-tuning, preference pairs for alignment, and benchmark datasets for evaluation. The distinction between training set (what the model learns from), validation set (what guides hyperparameter tuning), and test set (what measures final performance) is fundamental — evaluating on training data is meaningless because the model has memorized it.

The Data Scaling Story

LLM pre-training datasets have grown from millions of tokens (early GPT) to trillions (modern models). Common Crawl, Wikipedia, books, code repositories, scientific papers, and curated web text form the typical mix. But more data isn't always better — the Chinchilla scaling laws showed that data quality and quantity must scale together with model size. Deduplication, filtering toxic or low-quality content, and balancing domains are all critical steps.

Bias Lives in the Data

Every dataset carries the biases of its sources. A model trained mostly on English web text will perform worse on other languages. A dataset scraped from the internet inherits society's prejudices. This isn't a problem you can fix with architecture — it requires careful data curation, auditing, and post-training mitigation. The most impactful AI ethics work often happens at the dataset level.

相关概念

← 所有术语
← Data Centers Decart AI →