This work presents a reproducible empirical comparison of Direct Preference Optimization (DPO) and Reinforcement Learning from Human Feedback (RLHF) for Large Language Model alignment. Experiments are conducted on GPT-2 (124M parameters) using the Anthropic HH-RLHF dataset. The study evaluates alignment quality, reward accuracy, training efficiency, inference latency, and alignment tax under consumer hardware constraints. DPO achieves 71% reward accuracy and a reward margin of 0.640 without requiring a reward model. All experiments are reproducible on an NVIDIA RTX 3050 6GB GPU using open-source tooling. Source code and experimental artifacts are available at:https://github.com/AnthropicBots/dpo-vs-rlhf-alignmet-study
Mohit Yadav (Sun,) studied this question.