Key points are not available for this paper at this time.
The complexity of designing reward functions has been a major obstacle to the wide application of deep reinforcement learning (RL) techniques. Describing an agent's desired behaviors and properties can be difficult, even for experts. A new paradigm called reinforcement learning from human preferences (or preference-based RL) has emerged as a promising solution, in which reward functions are learned from human preference labels among behavior trajectories. However, existing methods for preference-based RL are limited by the need for accurate oracle preference labels. This paper addresses this limitation by developing a method for learning from diverse human preferences. The key idea is to stabilize reward learning through regularization and correction in a latent space. To ensure temporal consistency, a strong constraint is imposed on the reward model that forces its latent space to be close to a non-parameterized distribution. Additionally, a confidence-based reward model ensembling method is designed to generate more stable and reliable predictions. The proposed method is tested on a variety of tasks in DMcontrol and Meta-world and has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback, paving the way for real-world applications of RL methods.
Building similarity graph...
Analyzing shared references across papers
Loading...
Wanqi Xue
Bo An
Shuicheng Yan
Nanyang Technological University
Tencent (China)
Building similarity graph...
Analyzing shared references across papers
Loading...
Xue et al. (Fri,) studied this question.
www.synapsesocial.com/papers/68e5ee87b6db6435875831b6 — DOI: https://doi.org/10.24963/ijcai.2024/586
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: