What type of study is this?

This is a Experimental Study study.

September 23, 2025Open Access

Noise-Aware Direct Preference Optimization for RLAIF

Key Points

nrDPO-gated improves preference accuracy by 3.8% in noisy settings compared to vanilla DPO.
In realistic RLAIF settings, nrDPO-gated reaches approximately 60% alignment accuracy on a relabeled dataset.
AI teachers flipped about 50% of human preferences, impacting standard direct preference optimization performance.
nrDPO utilizes multiple correction methods to robustly optimize preferences in the presence of noise.

Abstract

Reinforcement Learning from Human Feedback (RLHF) produces powerful instruction-following models but relies on a preference-labeling process that is both costly and slow. An effective alternative, Reinforcement Learning from AI Feedback (RLAIF), uses large language models as teachers for relabeling; however, this introduces substantial label noise. In our setting, we found that AI teachers flipped approximately 50% of the original human preferences on the dataset, a condition that degrades the performance of standard direct preference optimization (DPO). We propose noise-robust DPO (nrDPO) and nrDPO-gated, two drop-in variants that make DPO resilient to noisy preferences. nrDPO reweights each pair by (i) a margin-confidence term from a frozen reference policy (base or SFT), (ii) a context-stability term that penalizes preferences that change under truncated histories, and (iii) a length correction to curb verbosity bias. nrDPO-gated further filters low-confidence pairs via a simple threshold on the reference margin. On a dataset with heavy synthetic noise (30% flips), nrDPO-gated improves the preference accuracy by +3.8% over vanilla DPO; in a realistic RLAIF setting, nrDPO-gated is the only configuration that recovers competitive alignment, reaching ≈60% on a 5k relabeled set (vs. ≈49–50% for vanilla DPO) and approaching RLHF baselines.

Noise-Aware Direct Preference Optimization for RLAIF

Key Points

Abstract

Cite This Study

Also Consider

Also Consider