Los puntos clave no están disponibles para este artículo en este momento.
In stochastic low-rank matrix bandit, the expected reward of an arm is equal to the inner product between its feature matrix and some unknown d₁ by d₂ low-rank parameter matrix ^* with rank r d₁ d₂. While all prior studies assume the payoffs are mixed with sub-Gaussian noises, in this work we loosen this strict assumption and consider the new problem of low-rank matrix bandit with heavy-tailed rewards (LowHTR), where the rewards only have finite (1+) moment for some (0, 1]. By utilizing the truncation on observed payoffs and the dynamic exploration, we propose a novel algorithm called LOTUS attaining the regret bound of order O (d³2r¹2T¹1+/Dₑₑ) without knowing T, which matches the state-of-the-art regret bound under sub-Gaussian noises~lu2021low, kang2022efficient with = 1. Moreover, we establish a lower bound of the order (d^1+ r^1+ T¹1+) = (T¹1+) for LowHTR, which indicates our LOTUS is nearly optimal in the order of T. In addition, we improve LOTUS so that it does not require knowledge of the rank r with O (dr³2T¹+1+2) regret bound, and it is efficient under the high-dimensional scenario. We also conduct simulations to demonstrate the practical superiority of our algorithm.
Kang et al. (Fri,) studied this question.