Zubnet AIसीखेंWiki › Human Evaluation
मूल सिद्धांत

Human Evaluation

Human Eval, Manual Evaluation
Humans को directly judge करवाकर AI output quality evaluate करना। Humans fluency, accuracy, helpfulness, safety, और क्या output actually request को meet करता है ये assess करते हैं। Expensive और slow होने के बावजूद, human evaluation gold standard बनी रहती है क्योंकि automated metrics अक्सर वो miss कर देती हैं जो users के लिए actually matter करता है।

यह क्यों matter करता है

हर automated metric human judgment के लिए एक proxy है, और हर proxy के blind spots हैं। BLEU factual errors detect नहीं कर सकती। Perplexity helpfulness measure नहीं कर सकती। LLM-as-judge approaches भी biases inherit करती हैं (उदाहरण के लिए verbose responses prefer करना)। जब stakes high हों — एक product launch करना, model versions compare करना, safety evaluate करना — human evaluation irreplaceable है।

Deep Dive

Human evaluation comes in several flavors: absolute rating (score this response 1–5 on helpfulness), pairwise comparison (which of these two responses is better?), and task-specific evaluation (did the model correctly extract all entities from this document?). Pairwise comparison is generally more reliable than absolute rating because humans are better at comparing than scoring — this is why Chatbot Arena uses pairwise voting.

The Cost Problem

Human evaluation is expensive: skilled annotators, clear guidelines, quality control, and statistical significance require time and money. Evaluating a model across diverse tasks might need thousands of human judgments. This is why automated metrics exist — they're free and instant. The practical approach is to use automated metrics for rapid iteration during development and human evaluation for milestone decisions (release, A/B testing, safety audits).

LLM-as-Judge

A middle ground: use a strong LLM to evaluate a weaker model's outputs. This is cheaper than human evaluation and often correlates well with human judgments. But it has known biases: LLM judges tend to prefer longer responses, more formatted responses, and responses that match their own style. Using multiple judge models and calibrating against human ratings helps, but LLM-as-judge should complement, not replace, human evaluation for important decisions.

संबंधित अवधारणाएँ

← सभी Terms
← Hugging Face Hume →