What question did this study set out to answer?

To address limitations in existing prediction models for unmanned vehicles in complex traffic scenes through a novel multi-modal framework.

April 1, 2026

Multi-modal depth modeling technology for behavior prediction of unmanned vehicles in complex traffic scenes

Key Points

To address limitations in existing prediction models for unmanned vehicles in complex traffic scenes through a novel multi-modal framework.
Developed a hierarchical attention fusion mechanism with Local Self-Attention and Cross-Modal Attention.
Introduced dynamic Intent Query Pair to integrate historical and real-time scene information.
Applied Implicit Scene Alignment self-supervised learning to enhance model generalization in extreme scenarios.
MDIG outperformed baseline methods in metrics like minADE, minFDE, and Miss Rate.
Demonstrated strong performance in scenarios involving emergency vehicles, indicating robustness in complex environments.

Abstract

The existing single-mode or simple fusion methods are difficult to deal with "long tail scenes" such as construction diversion and traffic police gestures, and lack of effective modeling of multi-agent dynamic interaction intentions. In this paper, a multi-modal depth modeling framework of Cross-Modal Dynamic Intent Graph (MDIG) is proposed, and a technical breakthrough is achieved through three innovations: A hierarchical attention fusion mechanism is designed, in which Local Self-Attention (LSA) is used to align the heterogeneous characteristics of agents, and Cross-Modal Attention (CMA) is used to capture the interaction between agents. The dynamic Intent Query Pair (IQP) is introduced, which combines the static query of historical pattern learning with the dynamic query of real-time scene awareness to cover the multimodal uncertainty of trajectory. Implicit Scene Alignment (ISA) self-supervised learning is adopted to constrain the consistent distribution of multimodal features in the potential space, and improve the generalization ability of the model to unlabeled extreme scenes. Experiments on Argoverse 2 and nuScenes data sets show that CMDIG is superior to the existing baseline methods in minADE, minFDE, Miss Rate and brier-minFDE, especially in the off-distribution test with emergency vehicles, which verifies its robustness and accuracy in complex interactive environment.

Bookmark

Multi-modal depth modeling technology for behavior prediction of unmanned vehicles in complex traffic scenes

Key Points

Abstract

Cite This Study