Zubnet AILearnWiki › Precision & Recall
Fundamentals

Precision & Recall

F1 Score, Confusion Matrix
Two complementary metrics for evaluating classifiers. Precision answers "of the items the model flagged as positive, how many actually are?" Recall answers "of all the actual positives, how many did the model find?" A spam filter with high precision rarely marks real email as spam. One with high recall catches most spam. The F1 score is their harmonic mean — a single number that balances both.

Why it matters

Accuracy alone is misleading. A model that never predicts "fraud" achieves 99.9% accuracy if only 0.1% of transactions are fraudulent — but it's completely useless. Precision and recall reveal the trade-offs: catching more fraud (higher recall) means more false alarms (lower precision), and vice versa. Every classification system in production is tuned based on this trade-off.

Deep Dive

The confusion matrix organizes predictions into four categories: True Positives (correctly flagged), False Positives (incorrectly flagged — Type I error), True Negatives (correctly passed), and False Negatives (missed — Type II error). Precision = TP / (TP + FP). Recall = TP / (TP + FN). F1 = 2 · (Precision · Recall) / (Precision + Recall).

The Trade-off in Practice

Most classifiers output a confidence score, and you choose a threshold above which to predict "positive." A low threshold catches more positives (high recall) but creates more false positives (low precision). A high threshold is more selective (high precision) but misses more positives (low recall). The optimal threshold depends on costs: in medical screening, missing a disease (false negative) is worse than a false alarm. In spam filtering, marking a real email as spam (false positive) is worse than letting spam through.

Beyond Binary

For multi-class problems, precision and recall are computed per class and then averaged. Macro-averaging treats all classes equally. Micro-averaging weights by class frequency. Weighted averaging weights by class support. The choice matters: if 90% of your data is class A, micro-average will be dominated by class A performance, potentially hiding poor performance on minority classes. In AI fairness work, per-class metrics are essential for ensuring the model works well for all groups.

Related Concepts

← All Terms
← Pre-training Prompt →