Abstract In offline reinforcement learning, model-based approaches have demonstrated superior data efficiency by leveraging learned dynamics models to generate additional training samples. However, due to inevitable model inaccuracies, directly deriving policies from such models often leads to suboptimal performance under the constraints of the offline setting. Prior work has attempted to mitigate this issue by adopting conservative strategies that avoid reliance on out-of-distribution transitions. Nevertheless, these methods still face notable challenges, as dynamics models trained solely on historical data typically struggle to generalize to unseen state-action pairs. In this paper, we propose a novel offline reinforcement learning method Dynamic Reward-Guided Multi-Head Attention for Actor-Critic Policy Learning Optimization (DRMAAC). DRMAAC introduces a dynamic-aware paradigm that focuses on capturing the intrinsic characteristics of the behavior policy. It leverages inverse reinforcement learning to recover a reward-consistent dynamics model and identify high-return states. Meanwhile, an Actor-Critic architecture enhanced with multi-head attention makes decisions guided by these high-value states. This integration enables the model to better capture long-term dependencies and prioritize informative features in complex state spaces. Empirical evaluations on the D4RL benchmark show that DRMAAC consistently outperforms previous state-of-the-art methods across a variety of tasks. These results highlight not only improved data efficiency but also strong generalization capabilities under diverse environmental conditions. Overall, DRMAAC presents a promising direction for advancing model-based offline reinforcement learning by combining attention mechanisms with reward-consistent dynamics modeling.
Building similarity graph...
Analyzing shared references across papers
Loading...
Xiaohui Huang
Jia Zong
Xiaofei Yang
Human-Centric Intelligent Systems
Building similarity graph...
Analyzing shared references across papers
Loading...
Huang et al. (Mon,) studied this question.
synapsesocial.com/papers/69843422f1d9ada3c1fb1dfa — DOI: https://doi.org/10.1007/s44230-026-00135-8