Multi-modal 3D object detection, which leverages the complementary strengths of LiDAR point clouds and camera RGB images, has emerged as a critical component of 3D perception in autonomous driving. As a critical challenge in multi-modal learning, modality alignment aims to establish accurate semantic correspondences across distinct modalities. However, existing methods encounter significant difficulties in achieving robust alignment when data from one modality is obscured, such as in the presence of object occlusion or adverse environmental conditions, including illumination variations and inclement weather. To alleviate this issue, we present CG-MAE, a dual-branch Bird’s-Eye-View (BEV) masked autoencoder framework based on cross-modal guidance for 3D object detection in autonomous driving. Specifically, a cross-modal guided reconstruction module is developed to predict the representations of the obscure objects in the BEV space, lowering the difficulty of the modality alignment during the multi-modal fusion process. To mimic the object obscuration caused by occlusion or adverse environments, this paper proposes a Ground truth-based foreground masking strategy to cover up the objects, such as vehicles and pedestrians, thereby encouraging the reconstruction module to focus on modality alignment in the foreground regions with high information density. Considering that all the modality data can be obscured, this paper builds a dual-branch BEV masked autoencoder to implement the reconstruction of the data from both the camera and LiDAR modalities. Extensive experiments on the nuScenes dataset with camera-LiDAR inputs demonstrate that the proposed framework achieves superior performance over existing state-of-the-art multi-modal learning methods.
Huo et al. (Thu,) studied this question.