Key points are not available for this paper at this time.
In this study, we propose a new method for constructing UCB-type algorithms for stochastic multi-armed bandits based on general convex optimization methods with an inexact oracle. We derive the regret bounds corresponding to the convergence rates of the optimization methods. We propose a new algorithm Clipped-SGD-UCB and show, both theoretically and empirically, that in the case of symmetric noise in the reward, we can achieve an O (TKT T) regret bound instead of O (T^1{1+} K^{1+}) for the case when the reward distribution satisfies Eₗ ₃|X|^1+ ^1+ ( (0, 1]), i. e. perform better than it is assumed by the general lower bound for bandits with heavy-tails. Moreover, the same bound holds even when the reward distribution does not have the expectation, that is, when <0.
Dorn et al. (Sat,) studied this question.