We identify a structured confound in Reinforcement Learning from Human Feedback (RLHF). Pairwise preference labels are intended to reflect the compared outputs. They may also reflect the rater's state during annotation. Under sustained stressful or distressing conditions, a rater's preferences may shift over time, so that preference data encodes rater state alongside judgments about response quality. We argue that, if present, such shifts would differ from random label noise. They could be correlated across annotators under shared conditions, and would not be guaranteed to cancel under aggregation. We propose rater state shift as a plausible, testable source of bias, and outline an audit framework for studying it. We do not infer the training history of any specific deployed model.
Kopteva et al. (Mon,) studied this question.