What type of study is this?

This is a Quantitative Study study.

October 5, 2025Open Access

Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

Key Points

Sample complexity bounds indicate exponential dependence on effective horizon and risk-sensitivity parameters.
The model-based approach generates PAC-bounds on the optimal state-action value function and policy.
Lower bounds establish that exponential dependence in risk-sensitivity and effective horizon is unavoidable.
PAC-bounds are tight in parameters, and polynomial dependence in all model parameters is not achievable.

Abstract

In this paper, we analyze the sample complexities of learning the optimal state-action value function Q^* and an optimal policy π^* in a finite discounted Markov decision process (MDP) where the agent has recursive entropic risk-preferences with risk-parameter β 0 and where a generative model of the MDP is available. We provide and analyze a simple model based approach which we call model-based risk-sensitive Q-value-iteration (MB-RS-QVI) which leads to (, δ) -PAC-bounds on \|Q^*-Qᵏ\|, and \|V^*-V^πₖ\| where Qₖ is the output of MB-RS-QVI after k iterations and πₖ is the greedy policy with respect to Qₖ. Both PAC-bounds have exponential dependence on the effective horizon 11-γ and the strength of this dependence grows with the learners risk-sensitivity |β|. We also provide two lower bounds which shows that exponential dependence on |β|11-γ is unavoidable in both cases. The lower bounds reveal that the PAC-bounds are tight in the parameters S, A, δ, and that unlike in the classical setting it is not possible to have polynomial dependence in all model parameters.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper