Optimizing long-term user satisfaction in sequential recommender systems is a critical challenge. Offline reinforcement learning (RL) offers a promising solution by learning recommendation policies from historical interaction logs without incurring the high costs of online exploration. However, offline RL suffers from severe distribution shift: the learned policy often overestimates the value of out-of-distribution (OOD) items, leading to unreliable recommendations and compromising user satisfaction. To address this issue, we propose a novel framework known as the Q-Learning Regularized Decision Transformer (QRDT). Built upon the Decision Transformer architecture, QRDT models recommendations as a sequence prediction task to capture complex user interest dynamics. To mitigate distribution shift, the QRDT integrates Kullback–Leibler (KL) divergence and maximum entropy regularization into the Q-value function, enabling conservative long-term value estimation while encouraging diverse exploration within the logged data distribution. Extensive experiments on four real-world Amazon e-commerce datasets (CDs, Clothing, Cellphones, and Beauty) demonstrate that the QRDT achieves competitive performance and outperforms the PGPR baseline in most scenarios. Specifically, the proposed method yields improvements of 2.99% in Hit Rate (HR), 2.19% in Normalized Discounted Cumulative Gain (NDCG), 0.94% in Recall, and 0.84% in Precision, verifying the effectiveness of our regularization approach.
Zhou et al. (Mon,) studied this question.