Offline-to-online reinforcement learning (O2O RL) enables agents to leverage offline pretrained policies and efficiently adapt to target environments through limited online interactions. However, during the transition from offline training to online finetuning, agents often experience a significant performance drop due to distribution shift. Existing approaches typically address this issue by either constraining policy updates or adjusting sample replay based on the properties of online interaction. Nevertheless, these methods fail to fundamentally address the core challenge in O2O RL, which lies in achieving a proper balance between optimism and pessimism in Q value estimation. In this article, we propose intrinsic value-aligned policy optimization (IVPO), a novel method that introduces intrinsic value extraction to compress state knowledge from the offline phase, thereby learning an intrinsic value function. IVPO integrates this intrinsic value function with the Q value function to guide the Q value updates during online learning. By capturing the potential value of state-action pairs and suppressing overestimation for out-of-distribution (OOD) actions, IVPO calibrates the Q value estimation, ultimately leading to more effective policy improvement. In addition, we provide a theoretical analysis of IVPO's regret bound and convergence in the online fine-tuning phase. Extensive experiments show that IVPO significantly mitigates Q value estimation errors and achieves state-of-the-art performance on the D4RL benchmark, improving overall task performance by 54.3% across 18 tasks initialized from offline policies.
Liu et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: