Provably Efficient Reinforcement Learning for Infinite-Horizon Average-Reward Linear MDPs

Key Points

Key points are not available for this paper at this time.

Abstract

We resolve the open problem of designing a computationally efficient algorithm for infinite-horizon average-reward linear Markov Decision Processes (MDPs) with O (T) regret. Previous approaches with O (T) regret either suffer from computational inefficiency or require strong assumptions on dynamics, such as ergodicity. In this paper, we approximate the average-reward setting by the discounted setting and show that running an optimistic value iteration-based algorithm for learning the discounted setting achieves O (T) regret when the discounting factor is tuned appropriately. The challenge in the approximation approach is to get a regret bound with a sharp dependency on the effective horizon 1 / (1 -). We use a computationally efficient clipping operator that constrains the span of the optimistic state value function estimate to achieve a sharp regret bound in terms of the effective horizon, which leads to O (T) regret.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper

Cite This Study

Hong et al. (Thu,) studied this question.

synapsesocial.com/papers/68e68d03b6db643587614ebf https://doi.org/https://doi.org/10.48550/arxiv.2405.15050