What question did this study set out to answer?

The aim is to identify potential biases in preference data caused by the rater's emotional state during annotation.

June 3, 2026Open Access

Rater State Bias in RLHF Preference Data: An Audit Framework

Key Points

The aim is to identify potential biases in preference data caused by the rater's emotional state during annotation.
Proposes an audit framework to study the influence of rater state on preference labeling.
Explores the correlation of rater preference shifts under shared stressful conditions.
Outlines characteristics that differentiate rater state bias from random label noise.
Suggests that rater preferences may shift under distress, impacting data quality.
Correlations in preferences across annotators may indicate systematic bias from shared conditions.
Recommends further investigation into rater state shifts as a potential source of bias.

Abstract

We identify a structured confound in Reinforcement Learning from Human Feedback (RLHF). Pairwise preference labels are intended to reflect the compared outputs. They may also reflect the rater's state during annotation. Under sustained stressful or distressing conditions, a rater's preferences may shift over time, so that preference data encodes rater state alongside judgments about response quality. We argue that, if present, such shifts would differ from random label noise. They could be correlated across annotators under shared conditions, and would not be guaranteed to cancel under aggregation. We propose rater state shift as a plausible, testable source of bias, and outline an audit framework for studying it. We do not infer the training history of any specific deployed model.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Kopteva et al. (Mon,) studied this question.

synapsesocial.com/papers/6a1fc730dee9eb8c0dce80d7 https://doi.org/https://doi.org/10.5281/zenodo.20499998

Bookmark

View Full Paper