What question did this study set out to answer?

This work aims to compare Direct Preference Optimization and Reinforcement Learning from Human Feedback for aligning large language models.

June 9, 2026Open Access

DPO vs. RLHF: An Empirical Comparison of Alignment Techniques for Large Language Models

Puntos clave

This work aims to compare Direct Preference Optimization and Reinforcement Learning from Human Feedback for aligning large language models.
Conducted reproducible experiments using GPT-2 with 124M parameters.
Utilized the Anthropic HH-RLHF dataset for evaluation.
Assessed alignment quality, reward accuracy, and training efficiency using consumer hardware.
DPO achieved 71% reward accuracy with a reward margin of 0.640 without a reward model.
All experiments were reproducible on an NVIDIA RTX 3050 6GB GPU.
Findings indicate differences in training efficiency and alignment quality between DPO and RLHF.

Resumen

This work presents a reproducible empirical comparison of Direct Preference Optimization (DPO) and Reinforcement Learning from Human Feedback (RLHF) for Large Language Model alignment. Experiments are conducted on GPT-2 (124M parameters) using the Anthropic HH-RLHF dataset. The study evaluates alignment quality, reward accuracy, training efficiency, inference latency, and alignment tax under consumer hardware constraints. DPO achieves 71% reward accuracy and a reward margin of 0.640 without requiring a reward model. All experiments are reproducible on an NVIDIA RTX 3050 6GB GPU using open-source tooling. Source code and experimental artifacts are available at:https://github.com/AnthropicBots/dpo-vs-rlhf-alignmet-study

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo