February 16, 2024Open Access

Provably Sample Efficient RLHF via Active Preference Optimization

Key Points

Key points are not available for this paper at this time.

Abstract

Reinforcement Learning from Human Feedback (RLHF) is pivotal in aligning Large Language Models (LLMs) with human preferences. While these aligned generative models have demonstrated impressive capabilities across various tasks, the dependence on high-quality human preference data poses a costly bottleneck in practical implementation of RLHF. Hence better and adaptive strategies for data collection is needed. To this end, we frame RLHF as a contextual preference bandit problem with prompts as contexts and show that the naive way of collecting preference data by choosing prompts uniformly at random leads to a policy that suffers an (1) suboptimality gap in rewards. Then we propose Active Preference Optimization (APO), an algorithm that actively selects prompts to collect preference data. Under the Bradley-Terry-Luce (BTL) preference model, APO achieves sample efficiency without compromising on policy performance. We show that given a sample budget of T, the suboptimality gap of a policy learned via APO scales as O (1/T). Next, we propose a compute-efficient batch version of APO with minor modification and evaluate its performance in practice. Experimental evaluations on a human preference dataset validate APO's efficacy as a sample-efficient and practical solution to data collection for RLHF, facilitating alignment of LLMs with human preferences in a cost-effective and scalable manner.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper