We propose MTC-BEV, a novel multi-modal 3D object detection framework for autonomous driving that achieves robust and efficient perception by combining spatial, temporal, and semantic cues. MTC-BEV integrates image and LiDAR features in the Bird’s-Eye View (BEV) space, where heterogeneous modalities are aligned and fused through the Bidirectional Cross-Modal Attention Fusion (BCAP) module with positional encodings. To model temporal consistency, the Temporal Fusion (TTFusion) module explicitly compensates for ego-motion and incorporates past BEV features. In addition, a segmentation-guided BEV enhancement projects 2D instance masks into BEV space, highlighting semantically informative regions. Experiments on the nuScenes dataset demonstrate that MTC-BEV achieves a nuScenes Detection Score (NDS) of 72.4% at 14.91 FPS, striking a favorable balance between accuracy and efficiency. These results confirm the effectiveness of the proposed design, highlighting the potential of semantic-guided cross-modal and temporal fusion for robust 3D object detection in autonomous driving.
Xi et al. (Mon,) studied this question.