BLEU & ROUGE: Definition & Meaning — AI Wiki

Métricas clássicas para avaliar qualidade de geração de texto comparando a saída do modelo a textos de referência. BLEU (Bilingual Evaluation Understudy) mede quantos n-grams no texto gerado aparecem na referência — originalmente projetado para tradução automática. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) mede quantos n-grams da referência aparecem no texto gerado — projetado para sumarização.

Por que importa

BLEU e ROUGE foram as métricas de avaliação padrão para NLP por mais de uma década e ainda são amplamente usadas. Entendê-las — e suas limitações — te ajuda a avaliar afirmações de pesquisa NLP e entender por que o campo está se movendo para avaliação humana e baseada em modelo. Uma alta pontuação BLEU não garante qualidade; uma baixa pontuação BLEU não garante fracasso.

Deep Dive

BLEU computes precision: what fraction of n-grams (1-grams, 2-grams, 3-grams, 4-grams) in the generated text also appear in the reference? ROUGE computes recall: what fraction of n-grams in the reference also appear in the generated text? BLEU penalizes outputs that are too short (brevity penalty). ROUGE-L uses longest common subsequence instead of fixed n-grams, capturing word order more flexibly.

Why They're Flawed

Both metrics reward surface-level similarity to references. A perfect paraphrase scores poorly (different words, same meaning). A repetitive, nonsensical text that happens to reuse reference n-grams can score well. They also require reference texts, which limits them to tasks where "correct" answers exist. For open-ended generation (creative writing, conversation), there's no single correct reference to compare against.

Modern Alternatives

The field has moved toward: BERTScore (uses embedding similarity instead of n-gram matching, captures paraphrase better), model-based evaluation (using an LLM to judge output quality), and human evaluation (the gold standard but expensive). For LLM evaluation specifically, benchmarks like MMLU, HumanEval, and Chatbot Arena have replaced BLEU/ROUGE as the primary comparison metrics. But BLEU and ROUGE remain useful for translation and summarization where reference comparison makes sense.

BLEU & ROUGE

Por que importa

Deep Dive

Why They're Flawed

Modern Alternatives

Conceitos relacionados