Cooperative multi-agent reinforcement learning (MARL) has been widely applied in various complex decision-making domains due to its exceptional coordination capabilities. However, existing methods primarily focus on single-task scenarios or fixed-policy opponents (in competitive settings), making them less effective in non-stationary environments where tasks or opponent policies dynamically change. In this paper, we propose an algorithm termed MOFA, a novel method based on the centralized training and decentralized execution (CTDE) framework. By combining mixture of experts (MoE) and value function decomposition, it achieves fast policy adaptation in partially observable environments. Specifically, we integrate a shared-parameter MoE module into agent networks. Gram-Schmidt process is utilized to maintain the independence of expert subspaces, facilitating the extraction of transferable policy skills across diverse tasks. To enhance activation efficiency in expert modules, we use sparsemax to produce sparse probability distributions, ensuring only a few relevant experts are active at once. Since partial observability induces an information bottleneck, we maximize mutual information (MI) between local and global information as a solution. This is formalized through the optimization of a variational lower bound on the MI, which enhances decentralized agents’ capability to infer global state features from limited local percepts. Experimental results demonstrate that, in two typical competitive environments, the MOFA algorithm exhibits significant advantages over multiple state-of-the-art algorithms in both multi-task learning and zero-shot generalization capabilities.
Fu et al. (Tue,) studied this question.