Collaborative control of multiple surface vessels remains a significant challenge in autonomous maritime operations, particularly within environments characterized by sparse rewards. Conventional Multi-Agent Proximal Policy Optimization (MAPPO) often suffers from inefficient credit assignment and slow convergence in such scenarios. To address these limitations, this paper proposes an enhanced MAPPO framework that integrates a counterfactual baseline—derived from Counterfactual Multi-Agent Policy Gradients (CMAPG)—into the Generalized Advantage Estimation (GAE) formulation. Furthermore, a Prioritized Experience Replay (PER) mechanism with importance sampling is incorporated to improve sample efficiency. The counterfactual baseline is necessary to provide precise, agent-specific learning signals within the on-policy paradigm, directly tackling the credit assignment problem. The PER mechanism, carefully adapted with importance sampling, is essential to break the sample-inefficiency barrier by strategically reusing valuable past experiences without compromising stability. This synergistic approach refines credit assignment by isolating individual contributions and maximizes the utility of valuable historical experiences. Simulation results and comparisons validate the enhanced control performance of the proposed controller.
Wang et al. (Sat,) studied this question.