Perplexity (Metric): Definition & Meaning — AI Wiki

Language model text कितनी अच्छी तरह predict करता है उसका एक measurement। Technically, ये average cross-entropy loss का exponential है। Intuitively, ये represent करता है कि हर step पर “model कितने tokens के बीच choose कर रहा है”। 10 की perplexity का मतलब model उतना uncertain है जैसे वो 10 equally likely options में से randomly pick कर रहा हो। Lower perplexity का मतलब better predictions।

यह क्यों matter करता है

Perplexity language models की raw text modeling ability compare करने के लिए सबसे fundamental metric है। ये held-out text पर compute होती है जो model ने training के दौरान कभी नहीं देखा। जब researchers कहते हैं “हमने WikiText-103 पर lower perplexity achieve की”, उनका मतलब उनका model natural text predict करने में बेहतर है। लेकिन अकेली perplexity आपको नहीं बताती कि एक model helpful, safe, या instructions follow करने में अच्छा है — वो benchmarks और human evaluation के लिए है।

Deep Dive

The formula: PPL = exp(−(1/N) ∑ log P(token_i | context_i)), where N is the number of tokens and P is the model's predicted probability for each actual token. If the model assigns high probability to every correct token, the sum of log probabilities is close to zero, and PPL approaches 1 (perfect). If the model is surprised by many tokens, the sum is a large negative number, and PPL is high.

Comparing Perplexities

You can only meaningfully compare perplexities between models that use the same tokenizer, or that are evaluated on the same text. A model with a larger vocabulary might have lower perplexity simply because it has more fine-grained tokens to assign probability to. Evaluation datasets matter too — perplexity on Wikipedia (clean, well-structured text) will be much lower than perplexity on Reddit (noisy, informal). Always check what tokenizer and evaluation set were used.

The Gap Between PPL and Usefulness

A model can have excellent perplexity but be terrible as an assistant. Pre-trained base models (before RLHF/DPO) typically have lower perplexity than their aligned counterparts, because alignment training optimizes for helpfulness rather than raw prediction accuracy. The aligned model might assign lower probability to the statistically most likely next token if that token would produce an unhelpful or unsafe response. This is a feature, not a bug — but it means perplexity is a measure of text modeling, not utility.

Perplexity (Metric)

यह क्यों matter करता है

Deep Dive

Comparing Perplexities

The Gap Between PPL and Usefulness

संबंधित अवधारणाएँ