Offline Reinforcement Learning (RL) has emerged as a promising paradigm to overcome the limitations of conventional RL methods that depend on extensive and often unsafe online interactions with the environment. By learning policies exclusively from pre-collected datasets, Offline RL aligns with the growing emphasis on data efficiency and safety in machine learning. However, existing off-policy RL algorithms encounter substantial difficulties when trained purely on offline data, primarily due to distributional shift between the training dataset and the learned policy. This issue becomes more pronounced with high-dimensional function approximation, leading to degraded performance and poor generalization. This survey provides a comprehensive overview of recent advances in Offline RL, with a particular focus on the challenges of distribution shift, generalization, and out-of-distribution (OOD) actions. We review the evolution of this field, discuss data limitations, and analyze the historical and theoretical foundations of distributional shift in Offline RL. Furthermore, we categorize existing approaches into four major groups including Q-value restriction methods, uncertainty-based Q-restriction methods, policy constraint methods, and uncertainty-based policy constraint methods, highlighting their core principles and practical implications. We also summarize common benchmarks such as D4RL, discuss dataset enhancement strategies, and examine evaluation metrics and computational efficiency. Through this analysis, the survey offers a unified perspective on current solutions and identifies open challenges, providing valuable guidance for advancing robust, generalizable, and trustworthy Offline RL in domains such as robotics, healthcare, and autonomous systems.
Samani et al. (Sun,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: