What type of study is this?

September 10, 2025Open Access

Neural Network-Based Parameter Tuning for Multi-Armed Bandit Algorithms

Key Points

DQN-tuned multi-armed bandit algorithms demonstrate improved performance in both static and dynamic environments.
Experimental results indicated that the DQN-enhanced upper confidence bound algorithm achieved significantly lower cumulative regret.
Traditional multi-armed bandit methods often exhibit limitations due to fixed exploration parameters and stationary reward assumptions.
Integrating neural networks with classical decision-making strategies can adapt to variable reward distributions effectively.

Abstract

This paper presents a novel approach for dynamically tuning the exploration parameter in Multi-Armed Bandit (MAB) algorithms using Deep Q-Networks (DQN), focusing on enhancing performance in static and dynamic environments. Traditional MAB algorithms such as Upper Confidence Bound (UCB) and Thompson Sampling (TS) rely on fixed exploration parameters and assume stationary reward distributions, limiting their effectiveness in real-world applications where reward distributions can be dynamic. This paper proposes a learning-based method where a DQN agent observes the state of the MAB environment and selects an appropriate exploration parameter from a predefined set to address this problem. Experimental results show that the DQN-enhanced UCB algorithm consistently outperforms its traditional counterpart in both static and dynamic environments by achieving lower cumulative regret. In contrast, DQN-tuned TS moderately improves dynamic settings but exhibits instability in static environments. These findings highlight the potential of integrating neural network-based learning with classical decision-making strategies to enable adaptive exploration in non-stationary environments, offering valuable insights for recommender systems and other sequential decision-making tasks.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper