Validation Set: Definition & Meaning — AI Wiki

从训练里留出的数据子集,用来在开发期间评估模型性能和调超参数。三路分割:训练集训练模型,验证集指导关于模型的决定(学习率、架构、何时停),测试集提供最终的无偏性能估计。验证集是你开发期的镜子。

为什么重要

没有验证集你就是盲飞。训练 loss 告诉你模型对训练数据拟合得多好,但不告诉你它泛化得多好。验证集回答真正重要的问题:“这个模型在没见过的数据上会表现如何?”模型开发期间的每个决定 — 超参数、架构选择、训练时长 — 都该在验证集上评估。

Deep Dive

Typical splits: 80% training, 10% validation, 10% test. For large datasets, smaller percentages for validation and test suffice (even 1% of a million examples is 10,000 — plenty for reliable evaluation). For small datasets, cross-validation is preferred (see: Cross-Validation). The key rule: never use the test set for any decision during development. It's only for the final evaluation. If you peek at the test set during development, your performance estimate becomes biased.

Stratification

When splitting data, ensure each split has a representative distribution of classes, domains, and other important characteristics. If your dataset is 90% English and 10% French, a random split might put all French examples in the training set, leaving you unable to evaluate French performance. Stratified splitting ensures proportional representation in each split. For time-series data, use temporal splits (train on past, validate on future) rather than random splits.

Validation in LLM Development

For LLM pre-training, the validation set is a held-out portion of the training corpus, used to compute perplexity during training. For fine-tuning, it's a held-out portion of the fine-tuning dataset. For alignment (RLHF/DPO), validation is more complex: automated metrics (reward model scores) plus human evaluation on held-out prompts. The validation strategy should match how the model will actually be used — if users will ask diverse questions, the validation set should be diverse.

Validation Set

为什么重要

Deep Dive

Stratification

Validation in LLM Development

相关概念