Los puntos clave no están disponibles para este artículo en este momento.
In value-based deep reinforcement learning (RL), value function approximation errors lead to suboptimal policies. Temporal difference (TD) learning is one of the most important methodologies to approximate state-action (Q) value function. In TD learning, it is critical to estimate Q values of greedy actions more accurately because a more accurate target Q value enhances the estimation accuracy of Q value. To improve the estimation accuracy of Q value, we propose an action-ranked TD learning method to enhance the performance of deep RL by weighting each TD error according to the rank of its corresponding state-action pair's value among all the Q values on a state. The proposed method can provide more accurate target values for TD learning, making the estimation of the Q value more accurate. We apply the proposed method to a representative value-based deep RL algorithm, and results show that the proposed method outperforms baselines on 31 out of 40 Atari games. Furthermore, we extend the proposed method to multi-agent deep RL. To adaptively determine the hyperparameter in action-ranked TD learning, we propose a meta action-ranked TD learning. A series of experiments quantitatively verify that our methods outperform baselines on Atari games, StarCraft-II, and Grid World environments.
Liu et al. (Mon,) studied this question.
Synapse has enriched 2 closely related papers on similar clinical questions. Consider them for comparative context: