The sparsity of team rewards significantly hinders the learning of optimal team policies in cooperative multiagent reinforcement learning (MARL). While augmenting sparse team rewards with individual rewards is a common solution, existing methods face three critical challenges: 1) inconsistency between the learned policy and the optimal team policy due to reward function modification; 2) incompatibility with different individual reward settings; and 3) suboptimal balance between individual and team reward-oriented policy optimization. To address these challenges, we propose CLOT, a novel policy consistency constrained multiagent policy optimization approach that leverages individual rewards in a reward setting-agnostic manner. Specifically, we first present a constrained policy optimization problem formulated by a consistency constraint between the team returns of the learned policy and those of the optimal team policy. Then, we develop a Lagrangian dual-based iterative policy optimization procedure to solve the formulated problem, deriving exact optimization objectives for policy training. Throughout this process, a dynamic Lagrangian multiplier update mechanism is proposed to automatically balance individual and team reward-oriented policy optimization. Extensive experimental evaluations across the StarCraft II Multiagent Challenge (SMAC), multiagent particle environment (MPE), and Google research football (GRF) environments demonstrate that our approach effectively addresses all three identified challenges, significantly enhancing performance in cooperative multiagent scenarios with sparse team rewards.
Zhang et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: