What question did this study set out to answer?

To develop a dual-agent framework that enhances recommender systems through effective policy learning and dynamic reward shaping.

December 21, 2025Open Access

DARLR: Dual-Agent Offline Reinforcement Learning for Recommender Systems with Dynamic Reward

Key Points

To develop a dual-agent framework that enhances recommender systems through effective policy learning and dynamic reward shaping.
Introduced a dual-agent framework (DARLR) for recommender systems.
Implemented a selector to identify reference users balancing similarity and diversity.
Refined reward estimations for dynamic reward shaping by aggregating information from selected users.
Adapted uncertainty penalties based on statistical features of selected users.
DARLR demonstrated superior performance in extensive experiments on four benchmark datasets.
Enhanced recommendation policies by dynamically updating world models.
Improved accuracy of reward estimations through effective user selection and aggregation.

Abstract

Model-based offline reinforcement learning (RL) has emerged as a promising approach for recommender systems, enabling effective policy learning by interacting with frozen world models. However, the reward functions in these world models, trained on sparse offline logs, often suffer from inaccuracies. Specifically, existing methods face two major limitations in addressing this challenge: (1) deterministic use of reward functions as static look-up tables, which propagates inaccuracies during policy learning, and (2) static uncertainty designs that fail to effectively capture decision risks and mitigate the impact of these inaccuracies. In this work, a dual-agent framework, DARLR, is proposed to dynamically update world models to enhance recommendation policies. To achieve this, a selector is introduced to identify reference users by balancing similarity and diversity so that the recommender can aggregate information from these users and iteratively refine reward estimations for dynamic reward shaping. Further, the statistical features of the selected users guide the dynamic adaptation of an uncertainty penalty to better align with evolving recommendation requirements. Extensive experiments on four benchmark datasets demonstrate the superior performance of DARLR, validating its effectiveness. The code is available at https: //github. com/ArronDZhang/DARLR.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper