Los puntos clave no están disponibles para este artículo en este momento.
Temporal difference (TD) learning algorithms with neural network function parameterization have well-established empirical success in many practical large-scale reinforcement learning tasks. However, theoretical understanding of these algorithms remains challenging due to the nonlinearity of the action-value approximation. In this paper, we develop an improved non-asymptotic analysis of the neural TD method with a general L-layer neural network. New proof techniques are developed and an improved new O (^-1) sample complexity is derived. To our best knowledge, this is the first finite-time analysis of neural TD that achieves an O (^-1) complexity under the Markovian sampling, as opposed to the best known O (^-2) complexity in the existing literature.
Ke et al. (Tue,) studied this question.