Zubnet AIAprenderWiki › Human Evaluation
Fundamentos

Human Evaluation

Human Eval, Manual Evaluation
Evaluar la calidad de salida IA haciendo que humanos la juzguen directamente. Los humanos evalúan fluidez, precisión, utilidad, seguridad, y si la salida realmente cumple con la solicitud. A pesar de ser cara y lenta, la evaluación humana sigue siendo el estándar de oro porque las métricas automatizadas a menudo pierden lo que realmente importa a los usuarios.

Por qué importa

Cada métrica automatizada es un proxy para el juicio humano, y cada proxy tiene puntos ciegos. BLEU no puede detectar errores factuales. Perplexity no puede medir utilidad. Incluso los enfoques LLM-as-judge heredan sesgos (prefiriendo respuestas verbosas, por ejemplo). Cuando las apuestas son altas — lanzar un producto, comparar versiones de modelo, evaluar seguridad — la evaluación humana es irremplazable.

Deep Dive

Human evaluation comes in several flavors: absolute rating (score this response 1–5 on helpfulness), pairwise comparison (which of these two responses is better?), and task-specific evaluation (did the model correctly extract all entities from this document?). Pairwise comparison is generally more reliable than absolute rating because humans are better at comparing than scoring — this is why Chatbot Arena uses pairwise voting.

The Cost Problem

Human evaluation is expensive: skilled annotators, clear guidelines, quality control, and statistical significance require time and money. Evaluating a model across diverse tasks might need thousands of human judgments. This is why automated metrics exist — they're free and instant. The practical approach is to use automated metrics for rapid iteration during development and human evaluation for milestone decisions (release, A/B testing, safety audits).

LLM-as-Judge

A middle ground: use a strong LLM to evaluate a weaker model's outputs. This is cheaper than human evaluation and often correlates well with human judgments. But it has known biases: LLM judges tend to prefer longer responses, more formatted responses, and responses that match their own style. Using multiple judge models and calibrating against human ratings helps, but LLM-as-judge should complement, not replace, human evaluation for important decisions.

Conceptos relacionados

← Todos los términos
← Hugging Face Hume →