Zubnet AILearnWiki › Loss Function
Training

Loss Function

Objective Function, Cost Function
A mathematical function that measures how wrong a model's predictions are. The model's entire goal during training is to minimize this number. For LLMs, the loss function is typically cross-entropy loss — it measures how surprised the model is by the actual next token compared to its predicted probability distribution. Lower loss means the model's predictions match reality more closely.

Why it matters

The loss function is the compass of training. Everything a model learns is in service of reducing this one number. Choosing the wrong loss function means the model optimizes for the wrong thing. Understanding loss helps you interpret training curves, diagnose problems (loss plateau? divergence? overfitting?), and understand why models behave the way they do.

Deep Dive

Cross-entropy loss for language models works like this: at each position in the text, the model predicts a probability distribution over its entire vocabulary. The loss is the negative log probability assigned to the actual next token. If the model predicted the correct token with 90% probability, loss is low (−log(0.9) ≈ 0.1). If it predicted the correct token with 1% probability, loss is high (−log(0.01) ≈ 4.6). Summing across all positions gives the total loss.

Perplexity: Loss Made Intuitive

Perplexity is just 2^(cross-entropy loss), or equivalently e^(loss) when using natural log. It represents "how many options the model is effectively choosing between at each token." A perplexity of 10 means the model is as uncertain as if it were picking randomly among 10 equally likely tokens. Lower perplexity = more confident and accurate predictions. It's the standard metric for comparing language models' raw text modeling ability.

Loss Isn't Everything

A lower loss doesn't always mean a better model for users. A model with slightly higher loss but better alignment (via RLHF/DPO) is usually more useful than a model with minimal loss but no alignment. Loss measures how well the model predicts text; alignment measures how well it follows instructions and avoids harm. The gap between "good at predicting text" and "good at being helpful" is what post-training addresses.

Related Concepts

← All Terms
← LoRA LSTM →