This paper investigates infinite-horizon average reward Constrained Markov Decision Processes (CMDPs) with general parametrization. We propose a Primal-Dual Natural Actor-Critic algorithm that adeptly manages constraints while ensuring a high convergence rate. In particular, our algorithm achieves global convergence and constraint violation rates of O (1/T) over a horizon of length T when the mixing time, ₌₈ₗ, is known to the learner. In absence of knowledge of ₌₈ₗ, the achievable rates change to O (1/T^0. 5-) provided that T O (₌₈ₗ^2/). Our results match the theoretical lower bound for Markov Decision Processes and establish a new benchmark in the theoretical exploration of average reward CMDPs.
Xu et al. (Wed,) studied this question.