February 7, 2022

Smooth Contextual Bandits: Bridging the Parametric and Nondifferentiable Regret Regimes

Key Points

Key points are not available for this paper at this time.

Abstract

Dynamic Personalized Decision Making Beyond the Super-Extrapolatable and Super-Local Cases Contextual bandit problems model the inherent trade-off between exploration and exploitation in personalized decision making in marketing, healthcare, revenue management, and more. Specifically, the trade-off is characterized by the optimal growth rate of the regret. Intuitively, the optimal rate should depend on how complex the underlying supervised learning problem is, namely, how much can observing reward in one context tell us about mean rewards in another. To formalize this intuitive relationship, Hu, Kallus, and Mao study in “Smooth Contextual Bandits: Bridging the Parametric and Nondifferentiable Regimes” a nonparametric contextual bandit problem in which mean reward functions are β-times differentiable (more generally, Hölder β-smooth). This interpolates between two extremes previously studied in isolation: nondifferentiable bandits (β ≤ 1), with which running separated noncontextual bandits in different context regions achieves rate-optimal regret, and parametric-response bandits (β = ∞), with which rate-optimal regret can be achieved with minimal or no exploration because of infinite extrapolatability across contexts. The authors develop a rate-optimal algorithm that operates neither fully locally nor fully globally, revealing the optimal regret rate in this in-between smooth setting and shedding light on the crucial interplay of functional complexity and regret in dynamic personalized decision making.

Mark Helpful

Bookmark

Relay