Facing the urgent need of Human-Robot Collaboration (HRC) for real-time and accurate interaction behavior prediction, this paper proposes a spatio-temporal multimodal prediction framework based on generative large model. This method regards human-computer interaction as a sequence generation task, uses the Transformer backbone network to jointly model multimodal information such as vision, language and force/joint state, and generates future human and robot behavior sequences through autoregressive multi-head space-time cross-attention mechanism. To enhance dynamic adaptability, contrastive learning reinforcement prediction representation discrimination is introduced, combined with proximal policy optimization (PPO) to fine tune the strategy network online with prediction error as a reward; Simultaneously utilizing attention weight visualization and Monte Carlo Dropout uncertainty quantification to achieve interpretable decision-making processes and controllable risks. The experiments on the public dataset HRI Interaction and the self built simulation environment HRC Sim show that the proposed method reduces the average displacement error (ADE), final displacement error (FDE), and multimodal accuracy (MM Acc) by 25.6%, 22.6%, and improves by 28.8% compared to the existing optimal baseline, respectively. It can also quickly correct online under sudden interference, verifying its comprehensive advantages in accuracy, robustness, and interpretability.
Dong et al. (Sun,) studied this question.