February 13, 2022Open Access

Sample-Efficient Reinforcement Learning with loglog(T) Switching Cost

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

We study the problem of reinforcement learning (RL) with low (policy) switching cost - a problem well-motivated by real-life RL applications in which deployments of new policies are costly and the number of policy updates must be low. In this paper, we propose a new algorithm based on stage-wise exploration and adaptive policy elimination that achieves a regret of O (H⁴S²AT) while requiring a switching cost of O (HSA T). This is an exponential improvement over the best-known switching cost O (H²SA T) among existing methods with O (poly (H, S, A) T) regret. In the above, S, A denotes the number of states and actions in an H-horizon episodic Markov Decision Process model with unknown transitions, and T is the number of steps. As a byproduct of our new techniques, we also derive a reward-free exploration algorithm with a switching cost of O (HSA). Furthermore, we prove a pair of information-theoretical lower bounds which say that (1) Any no-regret algorithm must have a switching cost of Ω (HSA) ; (2) Any O (T) regret algorithm must incur a switching cost of Ω (HSA T). Both our algorithms are thus optimal in their switching costs.

Me gusta

Guardar

Ver artículo completo

Cite This Study

Qiao et al. (Sun,) studied this question.

synapsesocial.com/papers/6a0ed751950456576347cd99 https://doi.org/https://doi.org/10.48550/arxiv.2202.06385

Me gusta

Guardar

Ver artículo completo