What question did this study set out to answer?

This research investigates the nature of disagreement in human evaluations of AI-generated content, challenging assumptions of objective quality.

May 5, 2026Open Access

Why Do Humans Disagree in AI Evaluation? Disagreement as Structure, Not Error

Key Points

This research investigates the nature of disagreement in human evaluations of AI-generated content, challenging assumptions of objective quality.
Analyzed 1,500 sentence-level evaluations from 300 GPT-4o-generated text samples.
Measured divergence among five independent raters in their grounding assessments.
Identified patterns in inter-rater variance related to structural complexity of outputs.
Five raters diverged by 22.1 percentage points in evaluations.
Structured Divergence demonstrated a consistent relationship: R0 < R1 < R2 as complexity increased.
Disagreement reflects different interpretive standards rather than random error.

Abstract

Human evaluation lies at the center of how AI systems are built and trusted. Benchmarks areconstructed from human labels; reinforcement learning from human feedback (RLHF) treats aggregatedhuman judgment as a proxy for quality; safety assessments rely on human raters to identify harmful outputs.Underlying all of these systems is a shared implicit assumption: that human judgment, when consistentlyapplied, approximates objective quality.This study challenges that assumption — not through theoretical argument, but through data.Analyzing 1,500 sentence-level evaluations drawn from 300 GPT-4o-generated text samples, we findthat five independent raters diverged by 22.1 percentage points in their grounding assessments. Yet thedisagreement was not random. Across all raters and all conditions, a strict monotonic pattern held withoutexception: R0 < R1 < R2. As structural depth increased, inter-rater variance narrowed.We term this phenomenon Structured Divergence. Disagreement is not noise. It is the systematicexpression of different interpretive standards encountering structurally graded output. The conclusion isstraightforward: human evaluation is not broken. It has been misread.

Read Full Paperexternally

AI से पूछें

Bookmark

View Full Paper