What question did this study set out to answer?

The aim is to address performance drops in offline-to-online reinforcement learning due to distribution shifts.

May 22, 2026

Intrinsic Value-Aligned Policy Optimization for Offline-to-Online Reinforcement Learning

Key Points

The aim is to address performance drops in offline-to-online reinforcement learning due to distribution shifts.
Proposed intrinsic value-aligned policy optimization (IVPO) method
Integrated intrinsic value and Q value functions for online learning
Extensive experiments on the D4RL benchmark across 18 tasks
IVPO reduced Q value estimation errors significantly
Achieved a 54.3% improvement in overall task performance
Demonstrated state-of-the-art performance compared to existing methods

Abstract

Offline-to-online reinforcement learning (O2O RL) enables agents to leverage offline pretrained policies and efficiently adapt to target environments through limited online interactions. However, during the transition from offline training to online finetuning, agents often experience a significant performance drop due to distribution shift. Existing approaches typically address this issue by either constraining policy updates or adjusting sample replay based on the properties of online interaction. Nevertheless, these methods fail to fundamentally address the core challenge in O2O RL, which lies in achieving a proper balance between optimism and pessimism in Q value estimation. In this article, we propose intrinsic value-aligned policy optimization (IVPO), a novel method that introduces intrinsic value extraction to compress state knowledge from the offline phase, thereby learning an intrinsic value function. IVPO integrates this intrinsic value function with the Q value function to guide the Q value updates during online learning. By capturing the potential value of state-action pairs and suppressing overestimation for out-of-distribution (OOD) actions, IVPO calibrates the Q value estimation, ultimately leading to more effective policy improvement. In addition, we provide a theoretical analysis of IVPO's regret bound and convergence in the online fine-tuning phase. Extensive experiments show that IVPO significantly mitigates Q value estimation errors and achieves state-of-the-art performance on the D4RL benchmark, improving overall task performance by 54.3% across 18 tasks initialized from offline policies.

اسأل الذكاء الاصطناعي

Bookmark