Perplexity (Metric): Definition & Meaning — AI Wiki

語言模型預測文字多好的度量。技術上,它是平均交叉熵損失的指數。直觀地,它表示模型在每一步「在多少個 token 之間選擇」。Perplexity 為 10 意味著模型的不確定度就像在 10 個等機率選項中隨機選。Perplexity 越低,預測越好。

為什麼重要

Perplexity 是比較語言模型原始文字建模能力最基本的指標。它在模型訓練中從未見過的 held-out 文字上運算。當研究者說「我們在 WikiText-103 上達到更低 perplexity」時,他們意思是他們的模型更會預測自然文字。但光靠 perplexity 不能告訴你一個模型是否有用、安全、或善於遵循指令 — 那是 benchmark 和人工評估的用途。

Deep Dive

The formula: PPL = exp(−(1/N) ∑ log P(token_i | context_i)), where N is the number of tokens and P is the model's predicted probability for each actual token. If the model assigns high probability to every correct token, the sum of log probabilities is close to zero, and PPL approaches 1 (perfect). If the model is surprised by many tokens, the sum is a large negative number, and PPL is high.

Comparing Perplexities

You can only meaningfully compare perplexities between models that use the same tokenizer, or that are evaluated on the same text. A model with a larger vocabulary might have lower perplexity simply because it has more fine-grained tokens to assign probability to. Evaluation datasets matter too — perplexity on Wikipedia (clean, well-structured text) will be much lower than perplexity on Reddit (noisy, informal). Always check what tokenizer and evaluation set were used.

The Gap Between PPL and Usefulness

A model can have excellent perplexity but be terrible as an assistant. Pre-trained base models (before RLHF/DPO) typically have lower perplexity than their aligned counterparts, because alignment training optimizes for helpfulness rather than raw prediction accuracy. The aligned model might assign lower probability to the statistically most likely next token if that token would produce an unhelpful or unsafe response. This is a feature, not a bug — but it means perplexity is a measure of text modeling, not utility.

Perplexity (Metric)

為什麼重要

Deep Dive

Comparing Perplexities

The Gap Between PPL and Usefulness

相關概念