Precision & Recall: Definition & Meaning — AI Wiki

Dos métricas complementarias para evaluar clasificadores. La precision responde «de los items que el modelo marcó como positivos, ¿cuántos realmente lo son?» El recall responde «de todos los positivos reales, ¿cuántos encontró el modelo?» Un filtro antispam con alta precision raramente marca correos reales como spam. Uno con alto recall atrapa la mayoría del spam. El F1 score es su media armónica — un solo número que balancea ambos.

Por qué importa

La precisión sola es engañosa. Un modelo que nunca predice «fraude» logra 99.9% de precisión si solo 0.1% de transacciones son fraudulentas — pero es completamente inútil. Precision y recall revelan los trade-offs: atrapar más fraude (mayor recall) significa más falsas alarmas (menor precision), y viceversa. Cada sistema de clasificación en producción se tunea basado en este trade-off.

Deep Dive

The confusion matrix organizes predictions into four categories: True Positives (correctly flagged), False Positives (incorrectly flagged — Type I error), True Negatives (correctly passed), and False Negatives (missed — Type II error). Precision = TP / (TP + FP). Recall = TP / (TP + FN). F1 = 2 · (Precision · Recall) / (Precision + Recall).

The Trade-off in Practice

Most classifiers output a confidence score, and you choose a threshold above which to predict "positive." A low threshold catches more positives (high recall) but creates more false positives (low precision). A high threshold is more selective (high precision) but misses more positives (low recall). The optimal threshold depends on costs: in medical screening, missing a disease (false negative) is worse than a false alarm. In spam filtering, marking a real email as spam (false positive) is worse than letting spam through.

Beyond Binary

For multi-class problems, precision and recall are computed per class and then averaged. Macro-averaging treats all classes equally. Micro-averaging weights by class frequency. Weighted averaging weights by class support. The choice matters: if 90% of your data is class A, micro-average will be dominated by class A performance, potentially hiding poor performance on minority classes. In AI fairness work, per-class metrics are essential for ensuring the model works well for all groups.

Precision & Recall

Por qué importa

Deep Dive

The Trade-off in Practice

Beyond Binary

Conceptos relacionados