Current annotation pipelines for reinforcement learning from human feedback (RLHF) and related training methods systematically destroy valuable information by collapsing expert disagreement into single consensus labels. But the information loss extends beyond disagreement: even when experts agree, the reasoning behind their agreement is discarded, regardless of whether they converged from different frameworks or applied the same one. This paper proposes a redesigned annotation pipeline that preserves raw annotator judgments, captures reasoning metadata, structures the full distribution of expert judgment as a training signal, and returns professional value to the annotators themselves. The concrete deliverable is the Rich Annotation Object (RAO): a structured data format replacing binary preference labels with full judgment distributions, per-annotator reasoning, cross-review matrices, and disagreement classification. The pipeline is not a disagreement-preservation tool. It is a signal enrichment tool across the entire distribution of expert judgment. We call this family of approaches RLHD (Reinforcement Learning from Human Disagreement). The paper identifies RL optimisation as structurally hostile to calibrated uncertainty on contested items and recommends supervised fine-tuning (SFT, training the model directly on calibrated demonstration responses) as the primary integration path. RL-based approaches are developed as alternatives. Direct Preference Optimisation (DPO, a method that learns from preference pairs without a separate reward model) is identified as structurally limited for highly contested items. The RAO supports multiple downstream applications beyond training, consolidated in §4. Seven testable predictions with named falsifiers are derived from established cognitive science findings. The pipeline is not empirically tested; a pilot study design is proposed. This is a collaboration invitation.
Ivan "HiP" Phan (Wed,) studied this question.