Zubnet AIAprenderWiki › Loss Function
Training

Loss Function

Objective Function, Cost Function
Una función matemática que mide qué tan equivocadas están las predicciones de un modelo. El objetivo entero del modelo durante el entrenamiento es minimizar este número. Para LLMs, la función de pérdida típicamente es la cross-entropy — mide qué tan sorprendido está el modelo por el siguiente token real comparado con su distribución de probabilidad predicha. Menor pérdida significa que las predicciones del modelo se acercan más a la realidad.

Por qué importa

La función de pérdida es la brújula del entrenamiento. Todo lo que un modelo aprende está al servicio de reducir este único número. Elegir la función de pérdida equivocada significa que el modelo optimiza para lo equivocado. Entender la pérdida te ayuda a interpretar curvas de entrenamiento, diagnosticar problemas (¿meseta de pérdida? ¿divergencia? ¿overfitting?), y entender por qué los modelos se comportan como lo hacen.

Deep Dive

Cross-entropy loss for language models works like this: at each position in the text, the model predicts a probability distribution over its entire vocabulary. The loss is the negative log probability assigned to the actual next token. If the model predicted the correct token with 90% probability, loss is low (−log(0.9) ≈ 0.1). If it predicted the correct token with 1% probability, loss is high (−log(0.01) ≈ 4.6). Summing across all positions gives the total loss.

Perplexity: Loss Made Intuitive

Perplexity is just 2^(cross-entropy loss), or equivalently e^(loss) when using natural log. It represents "how many options the model is effectively choosing between at each token." A perplexity of 10 means the model is as uncertain as if it were picking randomly among 10 equally likely tokens. Lower perplexity = more confident and accurate predictions. It's the standard metric for comparing language models' raw text modeling ability.

Loss Isn't Everything

A lower loss doesn't always mean a better model for users. A model with slightly higher loss but better alignment (via RLHF/DPO) is usually more useful than a model with minimal loss but no alignment. Loss measures how well the model predicts text; alignment measures how well it follows instructions and avoids harm. The gap between "good at predicting text" and "good at being helpful" is what post-training addresses.

Conceptos relacionados

← Todos los términos
← LoRA LSTM →