Key points are not available for this paper at this time.
Abstract Conventional online reinforcement learning (RL) systems interact with their environments to gather data, aiming to develop an optimal policy that maximizes a predefined cumulative reward. Nevertheless, the applicability of online RL is limited under conditions where cost and safety are critical concerns. Offline RL emerges as a solution by leveraging previously collected datasets to derive an effective policy without further interacting with the environment. A significant hurdle in offline RL is the propensity to overestimate the values of actions not represented in the data (out-of-distribution, or OOD actions), often due to insufficient exploration of the state-action space. Prior works tend to add to the algorithm complexity to improve the performance, however, this paper halves the algorithm computation as well as gaining performance improvement.In this paper, we delve into the inaccuracies associated with isolated critic training, devoid of policy enhancement typically driven by actor training. We hypothesize that the dataset's inherent Q-values (behavior Q-values) may be more accurate than those derived from actor-critic cycles, which are susceptible to overestimation and volatility.Upon validating the dataset Q-values' reliability, we propose pre-training models to capture these values before commencing offline training, potentially streamlining algorithmic efficiency across various models and reducing computational demands.Moreover, by harnessing precise dataset Q-value estimations, we advocate for a conservative approach to offline training, which demonstrably mitigates value overestimation.Our methodologies have the potential to be generalized among offline RL and may show a new perspective on the ways of optimizing the offline learning.
Zhang et al. (Wed,) studied this question.