What question did this study set out to answer?

The study aims to improve 3D object detection by enhancing the alignment between LiDAR and camera data under challenging conditions.

June 17, 2026Open Access

CG-MAE: BEV Masked Autoencoders Based on Cross-Modal Guidance for 3D Object Detection in Autonomous Driving

Key Points

The study aims to improve 3D object detection by enhancing the alignment between LiDAR and camera data under challenging conditions.
Developed a dual-branch BEV masked autoencoder framework for 3D object detection.
Implemented a cross-modal guided reconstruction module to aid modality alignment.
Employed a ground truth-based foreground masking strategy to simulate object obscuration.
CG-MAE significantly outperforms existing multi-modal learning methods based on extensive testing on the nuScenes dataset.
Achieved improved modality alignment and reconstruction accuracy during challenging environmental conditions.

Abstract

Multi-modal 3D object detection, which leverages the complementary strengths of LiDAR point clouds and camera RGB images, has emerged as a critical component of 3D perception in autonomous driving. As a critical challenge in multi-modal learning, modality alignment aims to establish accurate semantic correspondences across distinct modalities. However, existing methods encounter significant difficulties in achieving robust alignment when data from one modality is obscured, such as in the presence of object occlusion or adverse environmental conditions, including illumination variations and inclement weather. To alleviate this issue, we present CG-MAE, a dual-branch Bird’s-Eye-View (BEV) masked autoencoder framework based on cross-modal guidance for 3D object detection in autonomous driving. Specifically, a cross-modal guided reconstruction module is developed to predict the representations of the obscure objects in the BEV space, lowering the difficulty of the modality alignment during the multi-modal fusion process. To mimic the object obscuration caused by occlusion or adverse environments, this paper proposes a Ground truth-based foreground masking strategy to cover up the objects, such as vehicles and pedestrians, thereby encouraging the reconstruction module to focus on modality alignment in the foreground regions with high information density. Considering that all the modality data can be obscured, this paper builds a dual-branch BEV masked autoencoder to implement the reconstruction of the data from both the camera and LiDAR modalities. Extensive experiments on the nuScenes dataset with camera-LiDAR inputs demonstrate that the proposed framework achieves superior performance over existing state-of-the-art multi-modal learning methods.

Bookmark

View Full Paper

Bookmark

View Full Paper

CG-MAE: BEV Masked Autoencoders Based on Cross-Modal Guidance for 3D Object Detection in Autonomous Driving

Key Points

Abstract

Cite This Study