Convergence of a L2 regularized Policy Gradient Algorithm for the Multi Armed Bandit

Key Points

Key points are not available for this paper at this time.

Abstract

Although Multi Armed Bandit (MAB) on one hand and the policy gradient approach on the other hand are among the most used frameworks of Reinforcement Learning, the theoretical properties of the policy gradient algorithm used for MAB have not been given enough attention. We investigate in this work the convergence of such a procedure for the situation when a L2 regularization term is present jointly with the 'softmax' parametrization. We prove convergence under appropriate technical hypotheses and test numerically the procedure including situations beyond the theoretical setting. The tests show that a time dependent regularized procedure can improve over the canonical approach especially when the initial guess is far from the solution.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper

Cite This Study

Anita et al. (Fri,) studied this question.

synapsesocial.com/papers/68e7b285b6db64358770d43c https://doi.org/https://doi.org/10.1007/978-3-031-78395-1_27

Mark Helpful

Bookmark

Relay

View Full Paper