Key points are not available for this paper at this time.
A coverage assumption is critical with policy gradient methods, because while the objective function is insensitive to updates in unlikely states, the agent may need improvements in those states to reach a nearly optimal payoff. However, this assumption can be unfeasible in certain environments, for instance in online learning, or when restarts are possible only from a fixed initial state. In these cases, classical policy gradient algorithms like REINFORCE can have poor convergence properties and sample efficiency. Curious Explorer is an iterative state space pure exploration strategy improving coverage of any restart distribution ρ. Using ρ and intrinsic rewards, Curious Explorer produces a sequence of policies, each one more exploratory than the previous one, and outputs a restart distribution with coverage based on the state visitation distribution of the exploratory policies. This paper main results are a theoretical upper bound on how often an optimal policy visits poorly visited states, and a bound on the error of the return obtained by REINFORCE without any coverage assumption. Finally, we conduct ablation studies with REINFORCE and TRPO in two hard-exploration tasks, to support the claim that Curious Explorer can improve the performance of very different policy gradient algorithms.
Building similarity graph...
Analyzing shared references across papers
Loading...
Marco Miani
Maurizio Parton
Marco Romito
IEEE Transactions on Pattern Analysis and Machine Intelligence
Technical University of Denmark
University of Pisa
University of Chieti-Pescara
Building similarity graph...
Analyzing shared references across papers
Loading...
Miani et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68e58481b6db643587521aeb — DOI: https://doi.org/10.1109/tpami.2024.3460972