BLEU & ROUGE: Definition & Meaning — AI Wiki

Classic metrics for evaluating text generation quality by comparing model output to reference texts. BLEU (Bilingual Evaluation Understudy) measures how many n-grams in the generated text appear in the reference — originally designed for machine translation. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures how many n-grams from the reference appear in the generated text — designed for summarization.

Why it matters

BLEU and ROUGE were the standard evaluation metrics for NLP for over a decade and are still widely used. Understanding them — and their limitations — helps you evaluate NLP research claims and understand why the field is moving toward human evaluation and model-based evaluation. A high BLEU score doesn't guarantee quality; a low BLEU score doesn't guarantee failure.

Deep Dive

BLEU computes precision: what fraction of n-grams (1-grams, 2-grams, 3-grams, 4-grams) in the generated text also appear in the reference? ROUGE computes recall: what fraction of n-grams in the reference also appear in the generated text? BLEU penalizes outputs that are too short (brevity penalty). ROUGE-L uses longest common subsequence instead of fixed n-grams, capturing word order more flexibly.

Why They're Flawed

Both metrics reward surface-level similarity to references. A perfect paraphrase scores poorly (different words, same meaning). A repetitive, nonsensical text that happens to reuse reference n-grams can score well. They also require reference texts, which limits them to tasks where "correct" answers exist. For open-ended generation (creative writing, conversation), there's no single correct reference to compare against.

Modern Alternatives

The field has moved toward: BERTScore (uses embedding similarity instead of n-gram matching, captures paraphrase better), model-based evaluation (using an LLM to judge output quality), and human evaluation (the gold standard but expensive). For LLM evaluation specifically, benchmarks like MMLU, HumanEval, and Chatbot Arena have replaced BLEU/ROUGE as the primary comparison metrics. But BLEU and ROUGE remain useful for translation and summarization where reference comparison makes sense.

BLEU & ROUGE

Why it matters

Deep Dive

Why They're Flawed

Modern Alternatives

Related Concepts