Key points are not available for this paper at this time.
Recent advance in deep offline reinforcement learning (RL) has made it to train strong robotic agents from offline datasets. However, on the quality of the trained agents and the application being, it is often desirable to fine-tune such agents via further online. In this paper, we observe that state-action distribution shift lead to severe bootstrap error during fine-tuning, which destroys the good policy obtained via offline RL. To address this issue, we first propose balanced replay scheme that prioritizes samples encountered online while also the use of near-on-policy samples from the offline dataset. , we leverage multiple Q-functions trained pessimistically offline, preventing overoptimism concerning unfamiliar actions at novel states the initial training phase. We show that the proposed method improves-efficiency and final performance of the fine-tuned robotic agents on locomotion and manipulation tasks. Our code is available at: : //github. com/shlee94/Off2OnRL.
Lee et al. (Thu,) studied this question.