Modern warehouse dispatch systems require high throughput and flexibility while maintaining safe and reliable operations. However, traditional conveyor-based automation and state machine-based control methods struggle to handle fragile-item sorting and dynamic layouts, often resulting in dispatch errors during long-term operation. To address these challenges, this paper proposes a heterogeneous embodied multi-agent cooperation framework for parcel dispatching and sorting tasks. The proposed system integrates two quadruped robots and one wheeled dual-arm mobile manipulator, and formulates long-horizon collaborative dispatching as an observable multi-agent Markov decision process initialized through imitation learning. A reinforcement learning fine-tuning framework is developed by incorporating reward shaping, curriculum learning, and safety filtering, while Variance-Suppressed Policy Optimization (VSPO) and Reinforced Advantage Decision (ReAd) mechanisms are introduced to improve policy optimization efficiency, coordination stability and decision reliability. Experimental results demonstrate that the proposed method achieves a success rate of 77.7%, a categorization accuracy of 81.9%, and a throughput of 1.10 parcels per minute across L1 and L2 scenarios, while achieving a non-collision rate of 82.6% and a right-dispatch rate of 84.7%. Comparative and ablation analyses further show clear advantages over the baseline. Toward the next-generation logistics scenarios, these results demonstrate the potential of the proposed heterogeneous multi robot system in handling complex and long-horizon collaboration.
Zhang et al. (Sun,) studied this question.