This paper presents a novel approach for dynamically tuning the exploration parameter in Multi-Armed Bandit (MAB) algorithms using Deep Q-Networks (DQN), focusing on enhancing performance in static and dynamic environments. Traditional MAB algorithms such as Upper Confidence Bound (UCB) and Thompson Sampling (TS) rely on fixed exploration parameters and assume stationary reward distributions, limiting their effectiveness in real-world applications where reward distributions can be dynamic. This paper proposes a learning-based method where a DQN agent observes the state of the MAB environment and selects an appropriate exploration parameter from a predefined set to address this problem. Experimental results show that the DQN-enhanced UCB algorithm consistently outperforms its traditional counterpart in both static and dynamic environments by achieving lower cumulative regret. In contrast, DQN-tuned TS moderately improves dynamic settings but exhibits instability in static environments. These findings highlight the potential of integrating neural network-based learning with classical decision-making strategies to enable adaptive exploration in non-stationary environments, offering valuable insights for recommender systems and other sequential decision-making tasks.
Yuhan Shi (Wed,) studied this question.