Key points are not available for this paper at this time.
We present an off-policy actor-critic algorithm for Reinforcement Learning (RL) that combines ideas from gradient-free optimization via stochastic search with learned action-value function. The result is a simple procedure consisting of three steps: i) policy evaluation by estimating a parametric action-value function; ii) policy improvement via the estimation of a local non-parametric policy; and iii) generalization by fitting a parametric policy. Each step can be implemented in different ways, giving rise to several algorithm variants. Our algorithm draws on connections to existing literature on black-box optimization and 'RL as an inference' and it can be seen either as an extension of the Maximum a Posteriori Policy Optimisation algorithm (MPO) Abdolmaleki et al., 2018a, or as an extension of Trust Region Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES) Abdolmaleki et al., 2017b; Hansen et al., 1997 to a policy iteration scheme. Our comparison on 31 continuous control tasks from parkour suite Heess et al., 2017, DeepMind control suite Tassa et al., 2018 and OpenAI Gym Brockman et al., 2016 with diverse properties, limited amount of compute and a single set of hyperparameters, demonstrate the effectiveness of our method and the state of art results. Videos, summarizing results, can be found at goo.gl/HtvJKR .
Abdolmaleki et al. (Wed,) studied this question.