Building better AI benchmarks: How many raters are enough?
Algorithms & Theory
Google Research explores the trade-off between number of items and human raters per item to improve AI benchmark reproducibility and capture the nuance of human disagreement.
Line graph showing how the p-value decreases as the sample size (NxK) and K value increase.
Infographic showing how human disagreement on "Toxic" vs. "Not Toxic" labels is often collapsed into a single plurality label.
Disclaimer: This content has been automatically aggregated from GOOGLE AI for informational purposes. To read the original article, please visit GOOGLE AI.
Home

