Los puntos clave no están disponibles para este artículo en este momento.
Deep Reinforcement Learning (DRL) algorithms for continuous action spaces are to be brittle toward hyperparameters as well as. Soft Actor Critic (SAC) proposes an off-policy deep actor critic within the maximum entropy RL framework which offers greater and empirical gains. The choice of policy distribution, a factored, is motivated by dueits easy re-parametrization rather its modeling power. We introduce Normalizing Flow policies within the SAC that learn more expressive classes of policies than simple factored. also present a series of stabilization tricks that enable training of these policies in the RL setting. We show empirically on grid world tasks that our approach increases stability and is better to difficult exploration in sparse reward settings.
Ward et al. (Thu,) studied this question.