Los puntos clave no están disponibles para este artículo en este momento.
Algorithms for learning the optimal policy of a Markov decision process (MDP) based on simulated transitions are formulated and analyzed. These are variants of the well-known "actor-critic" (or "adaptive critic") algorithm in the artificial intelligence literature. Distributed asynchronous implementations are considered. The analysis involves two time scale stochastic approximations.
Konda et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: