Bandit problems provide a mathematical framework for decision-making under uncertainty, specifically designed to tackle the exploration vs. exploitation problem, which is broadly applicable in Markov Decision Processes and Reinforcement Learning. This thesis studies extensions of Thompson sampling, a randomized exploration algorithm for optimally playing bandit problems. We first review several standard bandit formulations, including K-armed, stochastic linear, contextual linear, and general function approximation bandits, along with the frequentist and Bayesian notations of regret and measures of function class complexity. Then we examine a recent analysis of Thompson sampling in stochastic linear bandits that relies on the action set geometry to inject sufficient optimism to achieve minimax optimal regret. Motivated by this geometric approach, we attempt to extend their analysis to the general function approximation setting, coming up short and showing a counterexample of how restrictive the strong convexity and smoothness assumptions are. After this negative result, we propose a Mirror-Descent Thompson Sampling algorithm for contextual linear bandits, replacing the greedy action selection from the sampled reward parameter with a policy distribution update using KL-regularized mirror descent. Yeilding a more stable policy evolution and allowing application-specific regularization while preserving near-minimax regret up to logarithmic factors with an added mirror descent penalty term.
Max Van Fleet (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: