Google Research scientists Flip Korn and Chris Welty have developed a framework that exposes a fundamental flaw in how AI benchmarks are built: most use too few human raters per evaluation item. Their research on the "(N,K) trade-off" — balancing the number of items versus raters per item — found that the industry standard of 1-5 raters often fails to capture natural human disagreement, making benchmarks less reproducible than researchers assume.

This matters because AI evaluation has historically favored breadth over depth, asking many people to rate different items rather than having multiple people rate the same items. The problem becomes acute in subjective tasks like toxicity detection, where human perspectives naturally vary. When benchmarks ignore this disagreement by defaulting to plurality voting, they create a false sense of ground truth that doesn't reflect real-world complexity. Two toxicity examples might have identical plurality scores but vastly different confidence levels among raters.

What's striking is how little research has examined this issue despite its impact on reproducibility — the ability for different teams to run the same evaluation and get consistent results. The researchers developed a simulator based on real toxicity and hate speech datasets to stress-test different rating configurations, providing what they call a "roadmap" for more reliable benchmarks.

For developers building AI systems, this research suggests you should be skeptical of benchmarks that don't report inter-rater agreement or use minimal human validation. When evaluating models on subjective tasks, consider the confidence intervals around benchmark scores, not just the headline numbers. The trade-off between annotation budget and reliability isn't just an academic concern — it directly affects whether your model comparisons mean anything in production.