Convergence of a L2 regularized Policy Gradient Algorithm for the Multi Armed Bandit | Synapse