This paper presents an in-depth analysis of the Multi-Armed Bandit (MAB) problem, tracing its evolution from its origins in the gambling domain of the 1940s to its current prominence in machine learning and artificial intelligence. The analysis begins with a historical overview, noting key developments like Herbert Robbins' probabilistic framework and the expansion of the problem into strategic decision-making in the 1970s. The emergence of algorithms like the Upper Confidence Bound (UCB) and Thompson Sampling in the late 20th century is highlighted, demonstrating the MAB problem's transition to practical applications. The integration of MAB algorithms with machine learning, particularly in the era of reinforcement learning, is explored, emphasizing their application in various domains such as online advertising, financial market trading, and clinical trials. The paper discusses the critical role of decision theory and probabilistic models in MAB problems, focusing on the balance between exploration and exploitation strategies. Recent advancements in Contextual Bandits, non-stationary reward distributions, and Multi-agent Bandits are examined, showcasing the ongoing evolution and adaptability of MAB problems.
Jiazhen Wu (Fri,) studied this question.