The cooperative multiagent reinforcement learning (MARL) has been widely used in many practical applications. Despite its success, a fundamental issue arises in MARL that agents face the dilemma of whether to select the best action to maximize rewards or to acquire more information collectively by exploring the novel states/actions due to partial observability. To solve this issue, existing methods merge exploration and exploitation methods. However, these methods are always suboptimal and may lead to failure in finishing tasks. In this article, we theoretically prove the existence of a latent state that can guarantee the optimal individual and global policies. Moreover, we prove that such a latent state can be approximately obtained by local observations. Based on the analysis, we propose a method named unified MARL (UMARL), which is a weighted value function factorization approach unifying exploitation and exploration in one framework. Specifically, we design the agent representation network (ARN) and individual weighting networks (IWNs) to learn agents' unified representations and weights of credit. Moreover, a latent state regularizer (LSR) is designed to encourage agents' representations to approximate the latent state. Extensive experiments show that UMARL can achieve superior performance compared with 12 state-of-the-art methods on m -step matrix game, level-based foraging (LBF), StarCraft II, and Google research football (GRF). The source code is available at: https: //github. com/CrazyBayes/UMARL.
Kong et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: