Precision & Recall: Definition & Meaning — AI Wiki

Classifiers evaluate करने के लिए दो complementary metrics। Precision answer करता है “model ने जो items positive flagged किए, उनमें से कितने actually हैं?” Recall answer करता है “सारे actual positives में से, model ने कितने ढूँढे?” High precision वाला एक spam filter real email को rarely spam mark करता है। High recall वाला most spam catch करता है। F1 score उनका harmonic mean है — एक single number जो दोनों को balance करे।

यह क्यों matter करता है

Accuracy अकेली misleading है। एक model जो कभी “fraud” predict नहीं करता वो 99.9% accuracy achieve करता है अगर सिर्फ 0.1% transactions fraudulent हों — लेकिन वो completely useless है। Precision और recall trade-offs reveal करते हैं: ज़्यादा fraud catch करना (higher recall) का मतलब ज़्यादा false alarms (lower precision), और vice versa। Production में हर classification system इसी trade-off के आधार पर tuned है।

Deep Dive

The confusion matrix organizes predictions into four categories: True Positives (correctly flagged), False Positives (incorrectly flagged — Type I error), True Negatives (correctly passed), and False Negatives (missed — Type II error). Precision = TP / (TP + FP). Recall = TP / (TP + FN). F1 = 2 · (Precision · Recall) / (Precision + Recall).

The Trade-off in Practice

Most classifiers output a confidence score, and you choose a threshold above which to predict "positive." A low threshold catches more positives (high recall) but creates more false positives (low precision). A high threshold is more selective (high precision) but misses more positives (low recall). The optimal threshold depends on costs: in medical screening, missing a disease (false negative) is worse than a false alarm. In spam filtering, marking a real email as spam (false positive) is worse than letting spam through.

Beyond Binary

For multi-class problems, precision and recall are computed per class and then averaged. Macro-averaging treats all classes equally. Micro-averaging weights by class frequency. Weighted averaging weights by class support. The choice matters: if 90% of your data is class A, micro-average will be dominated by class A performance, potentially hiding poor performance on minority classes. In AI fairness work, per-class metrics are essential for ensuring the model works well for all groups.

Precision & Recall

यह क्यों matter करता है

Deep Dive

The Trade-off in Practice

Beyond Binary

संबंधित अवधारणाएँ