BLEU & ROUGE: Definition & Meaning — AI Wiki

Model output को reference texts से compare करके text generation quality evaluate करने के classic metrics। BLEU (Bilingual Evaluation Understudy) measure करता है कि generated text में कितने n-grams reference में appear होते हैं — originally machine translation के लिए designed। ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measure करता है कि reference से कितने n-grams generated text में appear होते हैं — summarization के लिए designed।

यह क्यों matter करता है

BLEU और ROUGE एक दशक से ज़्यादा NLP के standard evaluation metrics थे और अभी भी widely used हैं। उन्हें समझना — और उनकी limitations को — आपको NLP research claims evaluate करने और ये समझने में help करता है कि field human evaluation और model-based evaluation की ओर क्यों move हो रहा है। एक high BLEU score quality guarantee नहीं करता; एक low BLEU score failure guarantee नहीं करता।

Deep Dive

BLEU computes precision: what fraction of n-grams (1-grams, 2-grams, 3-grams, 4-grams) in the generated text also appear in the reference? ROUGE computes recall: what fraction of n-grams in the reference also appear in the generated text? BLEU penalizes outputs that are too short (brevity penalty). ROUGE-L uses longest common subsequence instead of fixed n-grams, capturing word order more flexibly.

Why They're Flawed

Both metrics reward surface-level similarity to references. A perfect paraphrase scores poorly (different words, same meaning). A repetitive, nonsensical text that happens to reuse reference n-grams can score well. They also require reference texts, which limits them to tasks where "correct" answers exist. For open-ended generation (creative writing, conversation), there's no single correct reference to compare against.

Modern Alternatives

The field has moved toward: BERTScore (uses embedding similarity instead of n-gram matching, captures paraphrase better), model-based evaluation (using an LLM to judge output quality), and human evaluation (the gold standard but expensive). For LLM evaluation specifically, benchmarks like MMLU, HumanEval, and Chatbot Arena have replaced BLEU/ROUGE as the primary comparison metrics. But BLEU and ROUGE remain useful for translation and summarization where reference comparison makes sense.

BLEU & ROUGE

यह क्यों matter करता है

Deep Dive

Why They're Flawed

Modern Alternatives

संबंधित अवधारणाएँ