What does this research mean for the field?

Entropy regularization in discounted Markov decision processes decreases error exponentially with respect to the inverse regularization strength, improving upon previously known estimates. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.CHALLENGES_CONSENSUS.

What question did this study set out to answer?

To analyze how entropy regularization affects convergence rates in Markov decision processes.

February 21, 2026

Optimal rates of convergence for entropy regularization in discounted Markov decision processes

Key Points

To analyze how entropy regularization affects convergence rates in Markov decision processes.
Studied error in entropy regularization for infinite-horizon discrete discounted Markov decision processes.
Characterized convergence rates using Kullback–Leibler divergence and value estimates.
Identified a gradient flow corresponding to unregularized rewards.
Extended analysis to general convex potentials and their impact on policy gradients.
Error decreases exponentially with inverse regularization strength.
Provided a matching lower bound for the upper bound on convergence rate.
Established that natural policy gradient methods achieve exponentially decaying error with iterations.

Abstract

Abstract We study the error introduced by entropy regularization in infinite-horizon discrete discounted Markov decision processes. We show that this error decreases exponentially in the inverse regularization strength, both in a weighted Kullback–Leibler divergence and in value with a problem-specific exponent. This is in contrast to previously known estimates, of the order O (), where is the regularization strength. We provide a lower bound that matches our upper bound up to a polynomial term, thereby characterizing the exponential convergence rate for entropy regularization. Our proof relies on the observation that the solutions of entropy-regularized Markov decision processes solve a gradient flow of the unregularized reward with respect to a Riemannian metric common in natural policy gradient methods. This correspondence allows us to identify the limit of this gradient flow as the generalized maximum entropy optimal policy, thereby characterizing the implicit bias of this gradient flow, which corresponds to a time-continuous version of the natural policy gradient method. We use our improved error estimates to show that for entropy-regularized natural policy gradient methods, the overall error decays exponentially in the square root of the number of iterations, improving over existing sublinear guarantees. Finally, we extend our analysis to settings beyond the entropy. In particular, we characterize the implicit bias regarding general convex potentials and their resulting generalized natural policy gradients.

Bookmark

Optimal rates of convergence for entropy regularization in discounted Markov decision processes

Key Points

Abstract

Cite This Study

Also Consider

Also Consider