Key points are not available for this paper at this time.
The Multi-Armed Bandit (MAB) problem is a well-studied topic within stationary environments, where the reward distributions remain consistent over time. Nevertheless, many real-world applications often fall within non-stationary contexts, where the rewards from each arm can evolve. In light of this, our research focuses on examining and contrasting the effectiveness of two leading algorithms tailored for these shifting environments: the Sliding Window Upper Confidence Bound (SW-UCB) and the Discount Factor UCB (DF-UCB). By harnessing both simulated and real-world datasets, our evaluation encompasses adaptability, computational efficiency, and the potential for regret minimization. Our findings reveal that the SW-UCB is adept at swiftly adjusting to sudden shifts, whereas the DF-UCB emerges as the more resource-efficient option amidst gradual transitions. Notably, when pitted against conventional UCB algorithms within non-stationary contexts, both contenders exhibit substantial advancements. Such insights bear significant relevance to fields like online advertising, healthcare, and finance, where the capacity to nimbly adapt to dynamic environments is paramount.
Haochen Liu (Thu,) studied this question.