What question did this study set out to answer?

January 31, 2022Open Access

Learning Infinite-Horizon Average-Reward Markov Decision Processes with Constraints

Key Points

This research aims to minimize regret in infinite-horizon average-reward Markov Decision Processes while adhering to cost constraints.
Developed a policy optimization algorithm with action-value estimator and bonus term.
Analyzed ergodic MDPs for $ ilde{O}( ext{T}^{1/2})$ regret and constant constraint violation.
Created algorithm for weakly communicating MDPs, achieving $ ilde{O}( ext{T}^{2/3})$ regret and improved performance with modifications.
Achieved $ ilde{O}( ext{T}^{1/2})$ regret, outperforming previous algorithms with $ ilde{O}( ext{T}^{2/3})$ regret.
Maintained constant constraint violation for ergodic MDPs.
Introduced the first provable algorithms for weakly communicating MDPs with constraints.

Abstract

We study regret minimization for infinite-horizon average-reward Markov Decision Processes (MDPs) under cost constraints. We start by designing a policy optimization algorithm with carefully designed action-value estimator and bonus term, and show that for ergodic MDPs, our algorithm ensures O (T) regret and constant constraint violation, where T is the total number of time steps. This strictly improves over the algorithm of (Singh et al. , 2020), whose regret and constraint violation are both O (T^2/3). Next, we consider the most general class of weakly communicating MDPs. Through a finite-horizon approximation, we develop another algorithm with O (T^2/3) regret and constraint violation, which can be further improved to O (T) via a simple modification, albeit making the algorithm computationally inefficient. As far as we know, these are the first set of provable algorithms for weakly communicating MDPs with cost constraints.

Mark Helpful

Bookmark

Relay

View Full Paper