Key points are not available for this paper at this time.
Abstract Maximum entropy deep reinforcement learning has shown great promise in tackling various challenging continuous tasks. By incorporating the maximum entropy framework, the goal is to introduce more randomness in action selection and improve the training process. However, there exists a tradeoff between efficiency and stability, especially when dealing with large-scale tasks with high state and action dimensions.In certain situations, it becomes necessary to constrain the temperature hyperparameter of the maximum entropy term to prevent instability, which can hinder convergence. In this study, we propose an algorithm that combines adaptive and asymptotic maximum entropy with actor-critic random policies.Specifically, we introduce a state-dependent adaptive temperature to accelerate the training process and include an additional term involving asymptotic maximum entropy to ensure stable convergence. These components are combined with the selected critic value to serve as the target Q-value and the surrogate objective in the policy evaluation and improvement steps.The adaptive and asymptotic maximum entropy algorithm demonstrates robust adaptation to the efficiency-stability tradeoff, providing increased exploration and flexibility to address saddle point problems. We evaluate our method on various Gym tasks, and the results indicate that our proposed algorithms outperform several baselines in the domain of continuous control.
Zhang et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: