BLEU & ROUGE: Definition & Meaning — AI Wiki

Métricas clásicas para evaluar calidad de generación de texto comparando la salida del modelo con textos de referencia. BLEU (Bilingual Evaluation Understudy) mide cuántos n-gramas en el texto generado aparecen en la referencia — originalmente diseñado para traducción automática. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) mide cuántos n-gramas de la referencia aparecen en el texto generado — diseñado para resumen.

Por qué importa

BLEU y ROUGE fueron las métricas de evaluación estándar para NLP por más de una década y siguen siendo ampliamente usadas. Entenderlas — y sus limitaciones — te ayuda a evaluar afirmaciones de investigación NLP y entender por qué el campo se está moviendo hacia evaluación humana y basada en modelo. Un alto puntaje BLEU no garantiza calidad; un bajo puntaje BLEU no garantiza fracaso.

Deep Dive

BLEU computes precision: what fraction of n-grams (1-grams, 2-grams, 3-grams, 4-grams) in the generated text also appear in the reference? ROUGE computes recall: what fraction of n-grams in the reference also appear in the generated text? BLEU penalizes outputs that are too short (brevity penalty). ROUGE-L uses longest common subsequence instead of fixed n-grams, capturing word order more flexibly.

Why They're Flawed

Both metrics reward surface-level similarity to references. A perfect paraphrase scores poorly (different words, same meaning). A repetitive, nonsensical text that happens to reuse reference n-grams can score well. They also require reference texts, which limits them to tasks where "correct" answers exist. For open-ended generation (creative writing, conversation), there's no single correct reference to compare against.

Modern Alternatives

The field has moved toward: BERTScore (uses embedding similarity instead of n-gram matching, captures paraphrase better), model-based evaluation (using an LLM to judge output quality), and human evaluation (the gold standard but expensive). For LLM evaluation specifically, benchmarks like MMLU, HumanEval, and Chatbot Arena have replaced BLEU/ROUGE as the primary comparison metrics. But BLEU and ROUGE remain useful for translation and summarization where reference comparison makes sense.

BLEU & ROUGE

Por qué importa

Deep Dive

Why They're Flawed

Modern Alternatives

Conceptos relacionados