Abstract Autonomous vehicles and robots work in a dynamic environment, which includes complex urban streets, dynamic obstacles, and complex sensing environments. It makes the perception task more challenging. A single type of sensor alone cannot meet the needs of target detection. Multimodal sensor fusion, which combines LiDAR and camera modalities, provides complementary 2D semantic and3D geometric information. The performance of multimodal sensor fusion critically depends on precise extrinsic calibration between sensors. We propose CMTNet, a novel cross-modal Transformer architecture for robust extrinsic parameters estimation. The method uses depth maps as a unified representation of images and LiDAR point clouds. We utilize the ResNet-18 network to extract relative depth and semantic features from the monocular depth map. From the point cloud depth map, we extract precise 3D geometric features. Then, the correlation layer fuses the two features. Finally, the transformer estimates accurate calibration parameters based on multimodal features. We evaluated our method on the KITTI raw dataset, and it outperformed other methods. In addition, extensive experiments evaluating the model on KITTI odometry demonstrated that our method exhibited well generalization ability.
Sun et al. (Fri,) studied this question.