Key points are not available for this paper at this time.
Most contextual bandit algorithms minimize regret against the best fixed policy, a questionable benchmark for non-stationary environments that are ubiquitous in applications. In this work, we develop several efficient contextual bandit algorithms for non-stationary environments by equipping existing methods for i. i. d. problems with sophisticated statistical tests so as to dynamically adapt to a change in distribution. We analyze various standard notions of regret suited to non-stationary environments for these algorithms, including interval regret, switching regret, and dynamic regret. When competing with the best policy at each time, one of our algorithms achieves regret O (ST) if there are T rounds with S stationary periods, or more generally O (Δ^1/3T^2/3) where Δ is some non-stationarity measure. These results almost match the optimal guarantees achieved by an inefficient baseline that is a variant of the classic Exp4 algorithm. The dynamic regret result is also the first one for efficient and fully adversarial contextual bandit. Furthermore, while the results above require tuning a parameter based on the unknown quantity S or Δ, we also develop a parameter free algorithm achieving regret \S^{1/4T^3/4, Δ^1/5T^4/5\}. This improves and generalizes the best existing result Δ^0. 18T^0. 82 by Karnin and Anava (2016) which only holds for the two-armed bandit problem.
Luo et al. (Sat,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: