What type of study is this?

September 10, 2025Open Access

Adaptive Smooth Nonstationary Bandits

Key Points

Optimal dynamic regret rates can be achieved adaptively, even without knowledge of model parameters.
Minimax dynamic regret is established for general values of K, beta, and lambda in nonstationary bandits.
Existing bounds for upper limits were only known for restricted regimes, indicating significant contributions in this paper.
New gap-dependent rates improve performance in nonstationary settings with a safe arm, challenging prior pessimistic assumptions.

Abstract

. We study a \ (K\) -armed nonstationary bandit model where rewards change smoothly, as captured by Hölder class assumptions on rewards as functions of time. Such smooth changes are parametrized by a Hölder exponent \ (\) and coefficient \ (\). While various subcases of this general model have been studied in isolation, we first establish the minimax dynamic regret rate generally for all \ (K, , \). Next, we show that this optimal dynamic regret can be attained adaptively, without knowledge of \ (, \). To contrast, even with parameter knowledge, upper bounds were only previously known for limited regimes \ (1\) and \ (=2\) A. Slivkins, J. Mach. Learn. Res. , 15 (2014), pp. 2533–2568, R. Krishnamurthy and A. Gopalan, On Slowly-Varying Non-Stationary Bandits, preprint, arXiv: 2110. 12916, 2021, A. G. Manegueu, A. Carpentier, and Y. Yu, Generalized, Non-stationary Bandits, preprint, arXiv: 2102. 00725, 2021, S. Jia, Q. Xie, N. Kallus, and P. Frazier, Smooth non-stationary bandits, in International Conference on Machine Learning, JMLR. org, 2023, pp. 14930–14944. Thus, our work resolves open questions raised by disparate threads of the literature. We also study the problem of attaining faster gap-dependent regret rates in nonstationary bandits. While such rates are long known to be impossible in general A. Garivier and E. Moulines, On upper-confidence bound policies for switching bandit problems, in Proceedings of the 22nd International Conference on Algorithmic Learning Theory, Springer, 2011, pp. 174–188, we show that environments admitting a safe arm J. Suk and S. Kpotufe, Tracking most significant arm switches in bandits, in Conference on Learning Theory, 2022 allow for much faster rates than the worst-case scaling with \ (T\). While previous works in this direction focused on attaining the usual logarithmic regret bounds, as summed over stationary periods, our new gap-dependent rates reveal new optimistic regimes of nonstationarity where even the logarithmic bounds are pessimistic. We show that our new gap-dependent rate is tight and that its achievability (i. e. , as made possible by a safe arm) has a surprisingly simple and clean characterization within the smooth Hölder class model. Keywordsmultiarmed banditsnonstationarynonparametricadaptivityinstance-dependent regretMSC codes68T0562G99

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Joe Suk (Tue,) studied this question.

synapsesocial.com/papers/68c1c23554b1d3bfb60efbbb https://doi.org/https://doi.org/10.1137/24m167651x

Bookmark

View Full Paper