In this paper, we analyze the sample complexities of learning the optimal state-action value function Q^* and an optimal policy π^* in a finite discounted Markov decision process (MDP) where the agent has recursive entropic risk-preferences with risk-parameter β 0 and where a generative model of the MDP is available. We provide and analyze a simple model based approach which we call model-based risk-sensitive Q-value-iteration (MB-RS-QVI) which leads to (, δ) -PAC-bounds on \|Q^*-Qᵏ\|, and \|V^*-V^πₖ\| where Qₖ is the output of MB-RS-QVI after k iterations and πₖ is the greedy policy with respect to Qₖ. Both PAC-bounds have exponential dependence on the effective horizon 11-γ and the strength of this dependence grows with the learners risk-sensitivity |β|. We also provide two lower bounds which shows that exponential dependence on |β|11-γ is unavoidable in both cases. The lower bounds reveal that the PAC-bounds are tight in the parameters S, A, δ, and that unlike in the classical setting it is not possible to have polynomial dependence in all model parameters.
Mortensen et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: