Precision & Recall: Definition & Meaning — AI Wiki

Duas métricas complementares para avaliar classificadores. Precision responde “dos itens que o modelo sinalizou como positivos, quantos realmente são?” Recall responde “de todos os positivos reais, quantos o modelo encontrou?” Um filtro antispam com alta precision raramente marca email real como spam. Um com alto recall pega a maioria do spam. O F1 score é a média harmônica deles — um único número que balanceia ambos.

Por que importa

Precisão sozinha é enganosa. Um modelo que nunca prevê “fraude” alcança 99.9% de precisão se só 0.1% das transações são fraudulentas — mas é completamente inútil. Precision e recall revelam os trade-offs: pegar mais fraude (maior recall) significa mais falsos alarmes (menor precision), e vice-versa. Todo sistema de classificação em produção é tunado baseado nesse trade-off.

Deep Dive

The confusion matrix organizes predictions into four categories: True Positives (correctly flagged), False Positives (incorrectly flagged — Type I error), True Negatives (correctly passed), and False Negatives (missed — Type II error). Precision = TP / (TP + FP). Recall = TP / (TP + FN). F1 = 2 · (Precision · Recall) / (Precision + Recall).

The Trade-off in Practice

Most classifiers output a confidence score, and you choose a threshold above which to predict "positive." A low threshold catches more positives (high recall) but creates more false positives (low precision). A high threshold is more selective (high precision) but misses more positives (low recall). The optimal threshold depends on costs: in medical screening, missing a disease (false negative) is worse than a false alarm. In spam filtering, marking a real email as spam (false positive) is worse than letting spam through.

Beyond Binary

For multi-class problems, precision and recall are computed per class and then averaged. Macro-averaging treats all classes equally. Micro-averaging weights by class frequency. Weighted averaging weights by class support. The choice matters: if 90% of your data is class A, micro-average will be dominated by class A performance, potentially hiding poor performance on minority classes. In AI fairness work, per-class metrics are essential for ensuring the model works well for all groups.

Precision & Recall

Por que importa

Deep Dive

The Trade-off in Practice

Beyond Binary

Conceitos relacionados