What question did this study set out to answer?

The aim is to redesign annotation pipelines to preserve expert judgments and the reasoning behind them, rather than collapsing them into consensus labels.

April 18, 2026Open Access

The Judgment Paradox: Disagreement Valuation, Annotation Pipelines, and the Case for Preservation

Key Points

The aim is to redesign annotation pipelines to preserve expert judgments and the reasoning behind them, rather than collapsing them into consensus labels.
Proposed a Rich Annotation Object (RAO) to replace binary preference labels.
Structured data format captures full judgment distributions and reasoning metadata.
Identified the limitations of existing reinforcement learning methods for contested items.
Recommended supervised fine-tuning as the primary integration path.
Introduced a family of approaches called Reinforcement Learning from Human Disagreement.
Seven testable predictions were derived from cognitive science findings.
Proposed a pilot study design for empirical testing of the new pipeline.

Abstract

Current annotation pipelines for reinforcement learning from human feedback (RLHF) and related training methods systematically destroy valuable information by collapsing expert disagreement into single consensus labels. But the information loss extends beyond disagreement: even when experts agree, the reasoning behind their agreement is discarded, regardless of whether they converged from different frameworks or applied the same one. This paper proposes a redesigned annotation pipeline that preserves raw annotator judgments, captures reasoning metadata, structures the full distribution of expert judgment as a training signal, and returns professional value to the annotators themselves. The concrete deliverable is the Rich Annotation Object (RAO): a structured data format replacing binary preference labels with full judgment distributions, per-annotator reasoning, cross-review matrices, and disagreement classification. The pipeline is not a disagreement-preservation tool. It is a signal enrichment tool across the entire distribution of expert judgment. We call this family of approaches RLHD (Reinforcement Learning from Human Disagreement). The paper identifies RL optimisation as structurally hostile to calibrated uncertainty on contested items and recommends supervised fine-tuning (SFT, training the model directly on calibrated demonstration responses) as the primary integration path. RL-based approaches are developed as alternatives. Direct Preference Optimisation (DPO, a method that learns from preference pairs without a separate reward model) is identified as structurally limited for highly contested items. The RAO supports multiple downstream applications beyond training, consolidated in §4. Seven testable predictions with named falsifiers are derived from established cognitive science findings. The pipeline is not empirically tested; a pilot study design is proposed. This is a collaboration invitation.

The Judgment Paradox: Disagreement Valuation, Annotation Pipelines, and the Case for Preservation

Key Points

Abstract

Cite This Study