Zubnet AI學習Wiki › Batch Size & Epoch
Training

Batch Size & Epoch

Mini-Batch, Training Epoch
Batch size 是模型在更新參數前處理多少訓練範例。Epoch 是完整過一遍訓練資料集。一個在 100 萬範例上訓練 3 epoch、batch size 1000 的模型,每次更新處理 1000 範例,每個 epoch 要 1000 次更新,總共 3000 次更新。

為什麼重要

Batch size 和 epoch 是訓練中最基本的控制。Batch size 影響訓練速度、記憶體使用、甚至模型學什麼(小 batch 加入雜訊可能幫助泛化;大 batch 收斂更快但可能泛化更差)。Epoch 數決定模型看每個範例多少次 — 太少就欠擬合,太多就過擬合。

Deep Dive

In practice, stochastic gradient descent processes the training data in random mini-batches. Each batch gives an estimate of the true gradient — larger batches give better estimates (less noise) but cost more memory and compute per step. Typical batch sizes range from 32 (small models, single GPU) to millions of tokens (LLM pre-training across thousands of GPUs).

The Large-Batch Training Challenge

LLM pre-training uses enormous effective batch sizes (millions of tokens per update) distributed across many GPUs. At this scale, the learning rate must be carefully tuned — the linear scaling rule (double the batch size, double the learning rate) works up to a point, then breaks down. Gradient accumulation lets you simulate large batches on smaller hardware by accumulating gradients across multiple forward passes before updating.

Epochs in the LLM Era

Modern LLM pre-training typically runs for less than one epoch on the full dataset — the data is so large that the model never sees all of it. This is a shift from classical ML where 10–100 epochs was normal. Research suggests that repeating data (multiple epochs) can actually hurt LLM performance due to memorization effects, though this depends on data quality. Fine-tuning, by contrast, typically runs for 1–5 epochs on a much smaller dataset.

相關概念

← 所有術語
← Backpropagation Beam Search →