Key points are not available for this paper at this time.
The Explore-Then-Commit (ETC) algorithm, distinguished by its initial, extensive exploration of each arm, faces a paradox in its application. This approach, effective when the time horizon is indeterminate, reveals limitations in scenarios where the horizon is predefined. Addressing this conundrum, the study presents an innovative integration of Reinforcement Learning (RL) to refine the ETC's performance. Utilizing the bootstrap method, batches of data are procured, facilitating a dynamic programming approach to accurately estimate the probability distribution of each arm at varying time points. This estimation is then synergized with the ɛ-Greedy strategy, forming a hybrid method aimed at enhancing the traditional ETC algorithm's efficacy. By recalibrating the balance between exploration and exploitation, this method seeks to optimize decision-making in bandit problems, particularly in contexts where the time horizon is known. The proposed modification not only extends the applicability of the ETC algorithm but also presents a novel perspective in the realm of RL, demonstrating a sophisticated approach to addressing the inherent challenges in adaptive decision-making processes.
Kaiyu Yan (Fri,) studied this question.