What question did this study set out to answer?

The aim is to study extensions of Thompson sampling and propose a novel algorithm to improve decision-making in contextual linear bandits.

June 11, 2026Open Access

Extensions of Thompson Sampling

Key Points

The aim is to study extensions of Thompson sampling and propose a novel algorithm to improve decision-making in contextual linear bandits.
Reviewed standard bandit formulations including K-armed and stochastic linear bandits.
Analyzed a geometric approach in stochastic linear bandits to achieve minimax optimal regret.
Developed a Mirror-Descent Thompson Sampling algorithm for contextual linear bandits utilizing KL-regularized updates.
Demonstrated a counterexample to show the limitations of strong convexity and smoothness assumptions.
Achieved near-minimax regret up to logarithmic factors with the new algorithm compared to existing approaches.
Provided a more stable policy evolution and allowed for application-specific regularization.

Abstract

Bandit problems provide a mathematical framework for decision-making under uncertainty, specifically designed to tackle the exploration vs. exploitation problem, which is broadly applicable in Markov Decision Processes and Reinforcement Learning. This thesis studies extensions of Thompson sampling, a randomized exploration algorithm for optimally playing bandit problems. We first review several standard bandit formulations, including K-armed, stochastic linear, contextual linear, and general function approximation bandits, along with the frequentist and Bayesian notations of regret and measures of function class complexity. Then we examine a recent analysis of Thompson sampling in stochastic linear bandits that relies on the action set geometry to inject sufficient optimism to achieve minimax optimal regret. Motivated by this geometric approach, we attempt to extend their analysis to the general function approximation setting, coming up short and showing a counterexample of how restrictive the strong convexity and smoothness assumptions are. After this negative result, we propose a Mirror-Descent Thompson Sampling algorithm for contextual linear bandits, replacing the greedy action selection from the sampled reward parameter with a policy distribution update using KL-regularized mirror descent. Yeilding a more stable policy evolution and allowing application-specific regularization while preserving near-minimax regret up to logarithmic factors with an added mirror descent penalty term.

Bookmark

View Full Paper