Evaluation: Definition & Meaning — AI Wiki

The methods used to measure how well an AI model performs. This goes far beyond benchmarks — it includes human evaluation (having people rate outputs), A/B testing (comparing models on real traffic), red teaming (adversarial testing), domain-specific testing (medical accuracy, code correctness), and community leaderboards (Chatbot Arena, LMSYS). Good evaluation is harder than building the model.

Why it matters

If you can't measure it, you can't improve it. But AI evaluation is uniquely hard because the tasks are open-ended and quality is subjective. Benchmarks get gamed, human eval is expensive, and the model that scores highest on paper often isn't the best in practice. Building good evals is a superpower.

Evaluation

Why it matters

Related Concepts