Building better AI benchmarks: How many raters are enough?

Algorithms & Theory

15 April 2026 07:25 AM IST
Building better AI benchmarks: How many raters are enough?

Google Research explores the trade-off between number of items and human raters per item to improve AI benchmark reproducibility and capture the nuance of human disagreement.

ForestVTree-3a-Results
Line graph showing how the p-value decreases as the sample size (NxK) and K value increase.
ForestVTree-1b-Annotations
Infographic showing how human disagreement on "Toxic" vs. "Not Toxic" labels is often collapsed into a single plurality label.

Disclaimer: This content has been automatically aggregated from GOOGLE AI for informational purposes. To read the original article, please visit GOOGLE AI.