Human evaluation lies at the center of how AI systems are built and trusted. Benchmarks areconstructed from human labels; reinforcement learning from human feedback (RLHF) treats aggregatedhuman judgment as a proxy for quality; safety assessments rely on human raters to identify harmful outputs.Underlying all of these systems is a shared implicit assumption: that human judgment, when consistentlyapplied, approximates objective quality.This study challenges that assumption — not through theoretical argument, but through data.Analyzing 1,500 sentence-level evaluations drawn from 300 GPT-4o-generated text samples, we findthat five independent raters diverged by 22.1 percentage points in their grounding assessments. Yet thedisagreement was not random. Across all raters and all conditions, a strict monotonic pattern held withoutexception: R0 < R1 < R2. As structural depth increased, inter-rater variance narrowed.We term this phenomenon Structured Divergence. Disagreement is not noise. It is the systematicexpression of different interpretive standards encountering structurally graded output. The conclusion isstraightforward: human evaluation is not broken. It has been misread.
DAEDO JUN (Sun,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: