Overfitting happens when a model has enough capacity to memorize the specific patterns in its training data — including noise, outliers, and incidental correlations — rather than learning the underlying generalizable patterns. Mechanically, you detect it by tracking two loss curves during training: the training loss and the validation loss (computed on a held-out set the model never trains on). In a healthy training run, both curves go down together. Overfitting shows up as a divergence: training loss keeps decreasing while validation loss plateaus or starts climbing. That gap is the model spending its capacity on memorization rather than generalization.
The classic defenses against overfitting have been refined over decades and most still apply to modern LLM training. Dropout randomly zeroes out a fraction of neuron activations during training, forcing the model to build redundant representations rather than relying on any single pathway. Weight decay (L2 regularization) penalizes large weight values, discouraging the model from fitting narrow, high-magnitude patterns. Early stopping means monitoring validation loss and halting training when it stops improving, even if training loss is still falling. Data augmentation — creating synthetic variations of your training data — effectively expands the dataset without collecting new data. For language models, this might mean paraphrasing, back-translation, or context windowing strategies that present the same text with different surrounding context.
In the large language model era, overfitting has some non-obvious characteristics. Very large models trained on very large datasets are often in the "underfitting" regime for pre-training — they could benefit from more data or more training steps, not fewer. The Chinchilla scaling laws formalized this: for a given compute budget, there is an optimal balance between model size and training tokens, and most early LLMs were overtrained on too few tokens relative to their parameter count. Overfitting during pre-training at frontier scale is rare precisely because the datasets are so enormous. But it becomes a serious concern during fine-tuning, where datasets are typically orders of magnitude smaller. Fine-tuning a 7B model on a few thousand examples for more than 2-3 epochs almost always overfits, and the symptoms are recognizable: the model starts echoing training examples verbatim, loses the ability to handle prompts that differ from the training format, and may even degrade on general tasks it previously handled well.
One of the most insidious forms of overfitting in modern AI is benchmark overfitting, where training data happens to contain (or is deliberately selected to contain) questions similar to evaluation benchmarks. The model scores well on the benchmark but has not actually acquired the underlying capability. This is different from classical overfitting because the model generalizes fine to data similar to its training set — the problem is that the benchmark is measuring training-set-adjacent performance rather than true capability. This is why the field has moved toward held-out evaluation sets, contamination detection, and human-preference-based evaluation like Chatbot Arena, where the test questions are not known in advance and cannot be gamed through data selection.
For practitioners, the most useful mental model is that overfitting is not a binary state but a spectrum. Some degree of memorization is inevitable and even desirable — you want the model to know that Paris is the capital of France, which is a memorized fact. The problem arises when memorization crowds out generalization: the model recalls the exact phrasing from training instead of understanding the concept well enough to answer novel questions about it. Watching the training-validation loss gap, using parameter-efficient methods like LoRA (which limit the model's capacity to overfit), and testing on genuinely out-of-distribution examples are the best practical defenses.