Human Evaluation: Definition & Meaning — AI Wiki

讓人類直接判斷 AI 輸出品質來評估。人類評估流暢度、準確性、有用性、安全性、以及輸出是否真的滿足請求。儘管又貴又慢,人工評估仍是黃金標準,因為自動化指標常常錯過對使用者真正重要的東西。

為什麼重要

每個自動化指標都是人類判斷的代理,每個代理都有盲點。BLEU 不能偵測事實錯誤。Perplexity 不能衡量有用性。就算 LLM-as-judge 方法也繼承偏見(比如偏好冗長回覆)。當賭注高時 — 發布產品、比較模型版本、評估安全 — 人工評估無可替代。

Deep Dive

Human evaluation comes in several flavors: absolute rating (score this response 1–5 on helpfulness), pairwise comparison (which of these two responses is better?), and task-specific evaluation (did the model correctly extract all entities from this document?). Pairwise comparison is generally more reliable than absolute rating because humans are better at comparing than scoring — this is why Chatbot Arena uses pairwise voting.

The Cost Problem

Human evaluation is expensive: skilled annotators, clear guidelines, quality control, and statistical significance require time and money. Evaluating a model across diverse tasks might need thousands of human judgments. This is why automated metrics exist — they're free and instant. The practical approach is to use automated metrics for rapid iteration during development and human evaluation for milestone decisions (release, A/B testing, safety audits).

LLM-as-Judge

A middle ground: use a strong LLM to evaluate a weaker model's outputs. This is cheaper than human evaluation and often correlates well with human judgments. But it has known biases: LLM judges tend to prefer longer responses, more formatted responses, and responses that match their own style. Using multiple judge models and calibrating against human ratings helps, but LLM-as-judge should complement, not replace, human evaluation for important decisions.

Human Evaluation

為什麼重要

Deep Dive

The Cost Problem

LLM-as-Judge

相關概念