July 10, 2025

A Study of Exploration-Exploitation Strategies in Unconventional Situations

Key Points

Improvements in exploration-exploitation strategies enable better performance in unconventional settings, supporting relevant applications.
The study identifies limitations in classical algorithms like ε-greedy and Upper Confidence Bound, which struggle with time-varying rewards.
Analyses include contextual information's role in multi-armed bandit systems, offering deeper insights for real-world applications.
Exploratory findings may enhance algorithm references for future research, highlighting the need for innovative approaches in reinforcement learning.

Abstract

The exploration-exploitation problem is a central challenge in Reinforcement Learning (RL), and the Multi-Armed Bandits (MAB) serve as its foundation, providing a classical paradigm for exploration and exploitation strategies. With the development of big data and deep learning, the application of RL models in online learning, recommender systems, and other fields has become increasingly complex, giving rise to variants of models such as multi-objective optimization and stochastic adversarial. This paper reviews the limitations of classical algorithms such as ε-greedy, Upper Confidence Bound (UCB), and Thompson sampling in multi-armed bandit systems. It explores potential improvements in unconventional environments as far as the problem of rewards is concerned, which includes the case where the reward signal is time-varying and comes with some delay. And the limitations of traditional MAB, i.e., the inability to utilize contextual information, are explored in a relevant way. Meanwhile, scenario-oriented application-oriented MAB that are differentiated for real-world situations are mainly investigated as multi-objective, adversarial two major application-driven MAB. The cross-disciplinary characteristics of its variant algorithms are also examined to provide relevant algorithmic references for future research.

Mark Helpful

Bookmark

Relay