Building better AI benchmarks: How many raters are enough?

Algorithms & Theory

15 April 2026 07:25 AM IST

Building better AI benchmarks: How many raters are enough?

Google Research explores the trade-off between number of items and human raters per item to improve AI benchmark reproducibility and capture the nuance of human disagreement.

Line graph showing how the p-value decreases as the sample size (NxK) and K value increase.

Infographic showing how human disagreement on "Toxic" vs. "Not Toxic" labels is often collapsed into a single plurality label.

Disclaimer: This content has been automatically aggregated from GOOGLE AI for informational purposes. To read the original article, please visit GOOGLE AI.