BLEU & ROUGE: Definition & Meaning — AI Wiki

透過把模型輸出與參考文字比較來評估文字生成品質的經典指標。BLEU(Bilingual Evaluation Understudy)衡量生成文字中有多少 n-gram 出現在參考中 — 原本為機器翻譯設計。ROUGE(Recall-Oriented Understudy for Gisting Evaluation)衡量參考中有多少 n-gram 出現在生成文字中 — 為摘要設計。

為什麼重要

BLEU 和 ROUGE 是 NLP 十多年的標準評估指標,至今仍廣泛使用。理解它們 — 以及它們的侷限 — 幫你評估 NLP 研究主張,理解為什麼領域在走向人工評估和基於模型的評估。高 BLEU 分不保證品質;低 BLEU 分不保證失敗。

Deep Dive

BLEU computes precision: what fraction of n-grams (1-grams, 2-grams, 3-grams, 4-grams) in the generated text also appear in the reference? ROUGE computes recall: what fraction of n-grams in the reference also appear in the generated text? BLEU penalizes outputs that are too short (brevity penalty). ROUGE-L uses longest common subsequence instead of fixed n-grams, capturing word order more flexibly.

Why They're Flawed

Both metrics reward surface-level similarity to references. A perfect paraphrase scores poorly (different words, same meaning). A repetitive, nonsensical text that happens to reuse reference n-grams can score well. They also require reference texts, which limits them to tasks where "correct" answers exist. For open-ended generation (creative writing, conversation), there's no single correct reference to compare against.

Modern Alternatives

The field has moved toward: BERTScore (uses embedding similarity instead of n-gram matching, captures paraphrase better), model-based evaluation (using an LLM to judge output quality), and human evaluation (the gold standard but expensive). For LLM evaluation specifically, benchmarks like MMLU, HumanEval, and Chatbot Arena have replaced BLEU/ROUGE as the primary comparison metrics. But BLEU and ROUGE remain useful for translation and summarization where reference comparison makes sense.

BLEU & ROUGE

為什麼重要

Deep Dive

Why They're Flawed

Modern Alternatives

相關概念