May 6, 2022

Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds

Key Points

Key points are not available for this paper at this time.

Abstract

We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov decision process (MDP) is communicating with a finite, although unknown, diameter. Our main result is a high probability regret upper bound of Formula: see text for any communicating MDP with S states, A actions, and diameter D. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy in time horizon T. This result closely matches the known lower bound of Formula: see text. Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest. Funding: This work was supported in part by an NSF CAREER award CMMI 1846792 awarded to author S. Agrawal.

Bookmark

Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds

Key Points

Abstract

Cite This Study

Also Consider

Also Consider