In the field of Deep reinforcement learning (DRL), enhancing exploration capabilities and improving the accuracy of Q-value estimation remain two major challenges. Recently, double-actor DRL methods have emerged as a promising class of DRL approaches, achieving substantial advancements in both exploration and Q-value estimation. However, existing double-actor DRL methods feature actors that operate independently in exploring the environment, lacking mutual learning and collaboration, which leads to suboptimal policies. To address this challenge, this work proposes a generic solution that can be seamlessly integrated into existing double-actor DRL methods by promoting mutual learning among the actors to develop improved policies. Specifically, we calculate the difference in actions output by the actors and minimize this difference as a loss during training to facilitate mutual imitation among the actors. Simultaneously, we also minimize the differences in Q-values output by the various critics as part of the loss, thereby avoiding significant discrepancies in value estimation for the imitated actions. We present two specific implementations of our method and extend these implementations beyond double-actor DRL methods to other DRL approaches to encourage broader adoption. Experimental results demonstrate that our method significantly improves twenty state-of-the-art (SOTA) DRL methods, including SOTA double-actor DRL methods, across eleven tasks, as measured by return and other metrics.
Xu et al. (Thu,) studied this question.