What question did this study set out to answer?

This work aims to enhance exploration and Q-value estimation in deep reinforcement learning by promoting mutual learning among actors.

February 19, 2026

A Generic Competitive-Cooperative Actor-Critic Framework for Deep Reinforcement Learning

Key Points

This work aims to enhance exploration and Q-value estimation in deep reinforcement learning by promoting mutual learning among actors.
Proposed a generic framework for double-actor DRL methods.
Calculated action differences between actors and minimized this as a loss.
Minimized discrepancies in Q-values output by critics during training.
Implemented the method in various double-actor DRL methods and other DRL approaches.
Significantly improved twenty state-of-the-art DRL methods across eleven tasks.
Enhanced return and other performance metrics compared to baseline methods.

Abstract

In the field of Deep reinforcement learning (DRL), enhancing exploration capabilities and improving the accuracy of Q-value estimation remain two major challenges. Recently, double-actor DRL methods have emerged as a promising class of DRL approaches, achieving substantial advancements in both exploration and Q-value estimation. However, existing double-actor DRL methods feature actors that operate independently in exploring the environment, lacking mutual learning and collaboration, which leads to suboptimal policies. To address this challenge, this work proposes a generic solution that can be seamlessly integrated into existing double-actor DRL methods by promoting mutual learning among the actors to develop improved policies. Specifically, we calculate the difference in actions output by the actors and minimize this difference as a loss during training to facilitate mutual imitation among the actors. Simultaneously, we also minimize the differences in Q-values output by the various critics as part of the loss, thereby avoiding significant discrepancies in value estimation for the imitated actions. We present two specific implementations of our method and extend these implementations beyond double-actor DRL methods to other DRL approaches to encourage broader adoption. Experimental results demonstrate that our method significantly improves twenty state-of-the-art (SOTA) DRL methods, including SOTA double-actor DRL methods, across eleven tasks, as measured by return and other metrics.

Mark Helpful

Bookmark

Relay