September 16, 2024Open Access

Curious Explorer: a Provable Exploration Strategy in Policy Learning

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

A coverage assumption is critical with policy gradient methods, because while the objective function is insensitive to updates in unlikely states, the agent may need improvements in those states to reach a nearly optimal payoff. However, this assumption can be unfeasible in certain environments, for instance in online learning, or when restarts are possible only from a fixed initial state. In these cases, classical policy gradient algorithms like REINFORCE can have poor convergence properties and sample efficiency. Curious Explorer is an iterative state space pure exploration strategy improving coverage of any restart distribution ρ. Using ρ and intrinsic rewards, Curious Explorer produces a sequence of policies, each one more exploratory than the previous one, and outputs a restart distribution with coverage based on the state visitation distribution of the exploratory policies. This paper main results are a theoretical upper bound on how often an optimal policy visits poorly visited states, and a bound on the error of the return obtained by REINFORCE without any coverage assumption. Finally, we conduct ablation studies with REINFORCE and TRPO in two hard-exploration tasks, to support the claim that Curious Explorer can improve the performance of very different policy gradient algorithms.

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo

Cite This Study

Miani et al. (Mon,) studied this question.

synapsesocial.com/papers/68e58481b6db643587521aeb https://doi.org/https://doi.org/10.1109/tpami.2024.3460972

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo