Zubnet AILearnWiki › Evaluation
Training

Evaluation

Also known as: Evals, Model Evaluation
The methods used to measure how well an AI model performs. This goes far beyond benchmarks — it includes human evaluation (having people rate outputs), A/B testing (comparing models on real traffic), red teaming (adversarial testing), domain-specific testing (medical accuracy, code correctness), and community leaderboards (Chatbot Arena, LMSYS). Good evaluation is harder than building the model.

Why it matters

If you can't measure it, you can't improve it. But AI evaluation is uniquely hard because the tasks are open-ended and quality is subjective. Benchmarks get gamed, human eval is expensive, and the model that scores highest on paper often isn't the best in practice. Building good evals is a superpower.

Related Concepts

← All Terms
← Endpoint Fine-tuning →
ESC