The exploration-exploitation problem is a central challenge in Reinforcement Learning (RL), and the Multi-Armed Bandits (MAB) serve as its foundation, providing a classical paradigm for exploration and exploitation strategies. With the development of big data and deep learning, the application of RL models in online learning, recommender systems, and other fields has become increasingly complex, giving rise to variants of models such as multi-objective optimization and stochastic adversarial. This paper reviews the limitations of classical algorithms such as ε-greedy, Upper Confidence Bound (UCB), and Thompson sampling in multi-armed bandit systems. It explores potential improvements in unconventional environments as far as the problem of rewards is concerned, which includes the case where the reward signal is time-varying and comes with some delay. And the limitations of traditional MAB, i.e., the inability to utilize contextual information, are explored in a relevant way. Meanwhile, scenario-oriented application-oriented MAB that are differentiated for real-world situations are mainly investigated as multi-objective, adversarial two major application-driven MAB. The cross-disciplinary characteristics of its variant algorithms are also examined to provide relevant algorithmic references for future research.
Zhiping Zheng (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: