Key points are not available for this paper at this time.
In this work we present a new agent architecture, called Reactor, which multiple algorithmic and architectural contributions to produce an with higher sample-efficiency than Prioritized Dueling DQN (Wang et al. , 2016) and Categorical DQN (Bellemare et al. , 2017), while giving better-time performance than A3C (Mnih et al. , 2016). Our first contribution is a policy evaluation algorithm called Distributional Retrace, which brings-step off-policy updates to the distributional reinforcement learning. The same approach can be used to convert several classes of multi-step evaluation algorithms designed for expected value evaluation into ones. Next, we introduce the -leave-one-out policy algorithm which improves the trade-off between variance and bias by action values as a baseline. Our final algorithmic contribution is a new replay algorithm for sequences, which exploits the temporal of neighboring observations for more efficient replay prioritization. the Atari 2600 benchmarks, we show that each of these innovations to both the sample efficiency and final agent performance. Finally, demonstrate that Reactor reaches state-of-the-art performance after 200 frames and less than a day of training.
Gruslys et al. (Sat,) studied this question.