Q-value Mixing (QMIX) is a widely used algorithm for multi-agent reinforcement learning. However, multi-agent environments are quite complex and have high-dimensional action and state spaces, which leads to lower exploration efficiency and sparse global reward in the early stage of QMIX. To address the issue, we proposed an efficient hypernetwork parameters optimization method of QMIX based on differential evolution (DE-QMIX). DE-QMIX encodes the hypernetwork parameters of QMIX as population individuals, and obtains the best hypernetwork parameters by performing mutation, crossover, and selection operations on these individuals. The hypernetworks adjust parameters through the gradient descent method and feed the updated parameter information back to the current population to improve the overall efficiency of DE-QMIX. By optimizing the hypernetwork parameters, the joint action-value function Q t o t a l fitted from the mixing network can more accurately reflect the decision quality of the entire multi-agent system, which can guide the individual agent to reduce invalid or inefficient action selection during exploration and speed up the learning process of agents. The improvement of the Q t o t a l will guide the agent to choose better actions and improve the global reward. Our experiments on the StarCraft Multi-Agent Challenge (SMAC) platform have demonstrated that DE-QMIX achieves a higher average winning rate and global reward than QMIX and other existing approaches such as Multi-Agent Variational Exploration (MAVEN), Value-Decomposition Networks (VDN), and Joint Q-Function Transformation (QTRAN).
Cao et al. (Thu,) studied this question.