Zubnet AIसीखेंWiki › Dataset
मूल सिद्धांत

Dataset

Training Set, Data
Machine learning model को train, evaluate, या test करने के लिए use किया जाने वाला data का structured collection। Datasets labeled हो सकते हैं (हर example के पास एक known correct answer है) या unlabeled (annotations के बिना raw data)। एक dataset की quality, size, diversity, और representativeness fundamentally determine करती है कि एक model क्या सीख सकता है।

यह क्यों matter करता है

Garbage in, garbage out। सबसे elegant architecture भी एक bad dataset पर trained होगी तो bad results देगी। इसके विपरीत, excellent data पर trained एक simple model अक्सर noise पर trained एक complex model को outperform कर देता है। Dataset curation arguably AI development का सबसे impactful और सबसे कम glamorous part है।

Deep Dive

Datasets come in many forms: text corpora for language models, labeled images for classifiers, question-answer pairs for fine-tuning, preference pairs for alignment, and benchmark datasets for evaluation. The distinction between training set (what the model learns from), validation set (what guides hyperparameter tuning), and test set (what measures final performance) is fundamental — evaluating on training data is meaningless because the model has memorized it.

The Data Scaling Story

LLM pre-training datasets have grown from millions of tokens (early GPT) to trillions (modern models). Common Crawl, Wikipedia, books, code repositories, scientific papers, and curated web text form the typical mix. But more data isn't always better — the Chinchilla scaling laws showed that data quality and quantity must scale together with model size. Deduplication, filtering toxic or low-quality content, and balancing domains are all critical steps.

Bias Lives in the Data

Every dataset carries the biases of its sources. A model trained mostly on English web text will perform worse on other languages. A dataset scraped from the internet inherits society's prejudices. This isn't a problem you can fix with architecture — it requires careful data curation, auditing, and post-training mitigation. The most impactful AI ethics work often happens at the dataset level.

संबंधित अवधारणाएँ

← सभी Terms
← Data Centers Decart AI →