August 15, 2025

CMTNet: A Transformer-based Network for LiDAR-Camera Cross-Modal Calibration

Key Points

CMTNet improves extrinsic calibration accuracy in multimodal sensor fusion, enhancing target detection.
The model outperformed existing methods on the KITTI raw dataset, indicating significant advancement in sensor integration.
Using a transformer architecture, the approach effectively merges depth and semantic features from LiDAR and camera data.
Successful validation on KITTI odometry suggests strong generalization ability for various real-world applications.

Abstract

Abstract Autonomous vehicles and robots work in a dynamic environment, which includes complex urban streets, dynamic obstacles, and complex sensing environments. It makes the perception task more challenging. A single type of sensor alone cannot meet the needs of target detection. Multimodal sensor fusion, which combines LiDAR and camera modalities, provides complementary 2D semantic and3D geometric information. The performance of multimodal sensor fusion critically depends on precise extrinsic calibration between sensors. We propose CMTNet, a novel cross-modal Transformer architecture for robust extrinsic parameters estimation. The method uses depth maps as a unified representation of images and LiDAR point clouds. We utilize the ResNet-18 network to extract relative depth and semantic features from the monocular depth map. From the point cloud depth map, we extract precise 3D geometric features. Then, the correlation layer fuses the two features. Finally, the transformer estimates accurate calibration parameters based on multimodal features. We evaluated our method on the KITTI raw dataset, and it outperformed other methods. In addition, extensive experiments evaluating the model on KITTI odometry demonstrated that our method exhibited well generalization ability.

Bookmark

CMTNet: A Transformer-based Network for LiDAR-Camera Cross-Modal Calibration

Key Points

Abstract

Cite This Study