BLEU & ROUGE: Definition & Meaning — AI Wiki

通过把模型输出与参考文本比较来评估文本生成质量的经典指标。BLEU(Bilingual Evaluation Understudy)衡量生成文本中有多少 n-gram 出现在参考中 — 原本为机器翻译设计。ROUGE(Recall-Oriented Understudy for Gisting Evaluation)衡量参考中有多少 n-gram 出现在生成文本中 — 为摘要设计。

为什么重要

BLEU 和 ROUGE 是 NLP 十多年的标准评估指标,至今仍广泛使用。理解它们 — 以及它们的局限 — 帮你评估 NLP 研究主张,理解为什么领域在走向人工评估和基于模型的评估。高 BLEU 分不保证质量;低 BLEU 分不保证失败。

Deep Dive

BLEU computes precision: what fraction of n-grams (1-grams, 2-grams, 3-grams, 4-grams) in the generated text also appear in the reference? ROUGE computes recall: what fraction of n-grams in the reference also appear in the generated text? BLEU penalizes outputs that are too short (brevity penalty). ROUGE-L uses longest common subsequence instead of fixed n-grams, capturing word order more flexibly.

Why They're Flawed

Both metrics reward surface-level similarity to references. A perfect paraphrase scores poorly (different words, same meaning). A repetitive, nonsensical text that happens to reuse reference n-grams can score well. They also require reference texts, which limits them to tasks where "correct" answers exist. For open-ended generation (creative writing, conversation), there's no single correct reference to compare against.

Modern Alternatives

The field has moved toward: BERTScore (uses embedding similarity instead of n-gram matching, captures paraphrase better), model-based evaluation (using an LLM to judge output quality), and human evaluation (the gold standard but expensive). For LLM evaluation specifically, benchmarks like MMLU, HumanEval, and Chatbot Arena have replaced BLEU/ROUGE as the primary comparison metrics. But BLEU and ROUGE remain useful for translation and summarization where reference comparison makes sense.

BLEU & ROUGE

为什么重要

Deep Dive

Why They're Flawed

Modern Alternatives

相关概念