We propose MTC-BEV, a novel multi-modal 3D object detection framework for autonomous driving that achieves robust and efficient perception by combining spatial, temporal, and semantic cues. MTC-BEV integrates image and LiDAR features in the Bird’s-Eye View (BEV) space, where heterogeneous modalities are aligned and fused through the Bidirectional Cross-Modal Attention Fusion (BCAP) module with positional encodings. To model temporal consistency, the Temporal Fusion (TTFusion) module explicitly compensates for ego-motion and incorporates past BEV features. In addition, a segmentation-guided BEV enhancement projects 2D instance masks into BEV space, highlighting semantically informative regions. Experiments on the nuScenes dataset demonstrate that MTC-BEV achieves a nuScenes Detection Score (NDS) of 72.4% at 14.91 FPS, striking a favorable balance between accuracy and efficiency. These results confirm the effectiveness of the proposed design, highlighting the potential of semantic-guided cross-modal and temporal fusion for robust 3D object detection in autonomous driving.
Building similarity graph...
Analyzing shared references across papers
Loading...
Qiankai Xi
Li Ma
Jikai Zhang
World Electric Vehicle Journal
Inner Mongolia University of Science and Technology
Shanghai Business School
Building similarity graph...
Analyzing shared references across papers
Loading...
Xi et al. (Mon,) studied this question.
www.synapsesocial.com/papers/68bb5f266d6d5674bcd02fdc — DOI: https://doi.org/10.3390/wevj16090493