The Multi-Armed Bandit (MAB) problem is central to reinforcement learning, where it addresses the trade-off between exploration and exploitation. However, traditional MAB algorithms often encounter difficulties in non-stationary environments with evolving correlations between arms. This paper introduces the Correlation-Aware Collaborative Adaptive Window Algorithm (Adaptive UCB). The algorithm addresses key challenges by combining two techniques: Dynamic Window Recalibration (DWR) and Hierarchical Correlation-Aware Exploration (HCAE). The DWR mechanism adjusts the window size of the historical data based on real-time covariance analysis. This allows the algorithm to adapt to both abrupt and gradual changes in the environment. The HCAE method improves the selection of arms by clustering them and using Upper Confidence Bound (UCB) at the group level, which helps in exploration and minimizes sampling redundancy. The results of the experiments show that Adaptive UCB is better than other algorithms, which are Standard UCB, Sliding Window UCB, and Restart UCB. The advantage is most apparent in volatile environments and where arms are highly correlated. The Adaptive UCB has a much lower cumulative regret of 18.35% of the Standard UCB and 44.66% of the sliding window UCB. It also increases the mean average reward by 5.6% compared to Standard UCB and 1.52% compared to sliding window UCB, which proves that the algorithm is efficient in dynamic conditions.
Y. H. Xu (Wed,) studied this question.