The existing single-mode or simple fusion methods are difficult to deal with "long tail scenes" such as construction diversion and traffic police gestures, and lack of effective modeling of multi-agent dynamic interaction intentions. In this paper, a multi-modal depth modeling framework of Cross-Modal Dynamic Intent Graph (MDIG) is proposed, and a technical breakthrough is achieved through three innovations: A hierarchical attention fusion mechanism is designed, in which Local Self-Attention (LSA) is used to align the heterogeneous characteristics of agents, and Cross-Modal Attention (CMA) is used to capture the interaction between agents. The dynamic Intent Query Pair (IQP) is introduced, which combines the static query of historical pattern learning with the dynamic query of real-time scene awareness to cover the multimodal uncertainty of trajectory. Implicit Scene Alignment (ISA) self-supervised learning is adopted to constrain the consistent distribution of multimodal features in the potential space, and improve the generalization ability of the model to unlabeled extreme scenes. Experiments on Argoverse 2 and nuScenes data sets show that CMDIG is superior to the existing baseline methods in minADE, minFDE, Miss Rate and brier-minFDE, especially in the off-distribution test with emergency vehicles, which verifies its robustness and accuracy in complex interactive environment.
Zhou et al. (Sun,) studied this question.