June 19, 2024Open Access

Reinforcement Learning for Infinite-Horizon Average-Reward MDPs with Multinomial Logistic Function Approximation

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

We study model-based reinforcement learning with non-linear function approximation where the transition function of the underlying Markov decision process (MDP) is given by a multinomial logistic (MNL) model. In this paper, we develop two algorithms for the infinite-horizon average reward setting. Our first algorithm UCRL2-MNL applies to the class of communicating MDPs and achieves an O (dDT) regret, where d is the dimension of feature mapping, D is the diameter of the underlying MDP, and T is the horizon. The second algorithm OVIFH-MNL is computationally more efficient and applies to the more general class of weakly communicating MDPs, for which we show a regret guarantee of O (d^2/5 sp (v^*) T^4/5) where sp (v^*) is the span of the associated optimal bias function. We also prove a lower bound of (dDT) for learning communicating MDPs with MNL transitions of diameter at most D. Furthermore, we show a regret lower bound of (dH^3/2K) for learning H-horizon episodic MDPs with MNL function approximation where K is the number of episodes, which improves upon the best-known lower bound for the finite-horizon setting.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo

Cite This Study

Park et al. (Wed,) studied this question.

synapsesocial.com/papers/68e642a2b6db6435875d4698 https://doi.org/https://doi.org/10.48550/arxiv.2406.13633