What question did this study set out to answer?

The aim is to enhance user satisfaction in Sequential Recommender Systems by addressing distribution shift in offline reinforcement learning.

April 15, 2026Open Access

Mitigating Distribution Shift in Offline RL-Based Recommender Systems with a Q-Learning Regularization Decision Transformer

Key Points

The aim is to enhance user satisfaction in Sequential Recommender Systems by addressing distribution shift in offline reinforcement learning.
Developed the Q-Learning Regularized Decision Transformer (QRDT) framework.
Utilized KL divergence and maximum entropy regularization in the Q-value function.
Evaluated performance using four Amazon e-commerce datasets.
QRDT outperformed the PGPR baseline in most scenarios.
Achieved a 2.99% improvement in Hit Rate (HR).
Achieved a 2.19% improvement in Normalized Discounted Cumulative Gain (NDCG).
Achieved a 0.94% improvement in Recall.
Achieved a 0.84% improvement in Precision.

Abstract

Optimizing long-term user satisfaction in sequential recommender systems is a critical challenge. Offline reinforcement learning (RL) offers a promising solution by learning recommendation policies from historical interaction logs without incurring the high costs of online exploration. However, offline RL suffers from severe distribution shift: the learned policy often overestimates the value of out-of-distribution (OOD) items, leading to unreliable recommendations and compromising user satisfaction. To address this issue, we propose a novel framework known as the Q-Learning Regularized Decision Transformer (QRDT). Built upon the Decision Transformer architecture, QRDT models recommendations as a sequence prediction task to capture complex user interest dynamics. To mitigate distribution shift, the QRDT integrates Kullback–Leibler (KL) divergence and maximum entropy regularization into the Q-value function, enabling conservative long-term value estimation while encouraging diverse exploration within the logged data distribution. Extensive experiments on four real-world Amazon e-commerce datasets (CDs, Clothing, Cellphones, and Beauty) demonstrate that the QRDT achieves competitive performance and outperforms the PGPR baseline in most scenarios. Specifically, the proposed method yields improvements of 2.99% in Hit Rate (HR), 2.19% in Normalized Discounted Cumulative Gain (NDCG), 0.94% in Recall, and 0.84% in Precision, verifying the effectiveness of our regularization approach.

Mitigating Distribution Shift in Offline RL-Based Recommender Systems with a Q-Learning Regularization Decision Transformer

Key Points

Abstract

Cite This Study