Key points are not available for this paper at this time.
In autonomous driving, LiDAR point-clouds and RGB images are two major data modalities with complementary cues for 3D object detection. However, it is quite difficult to sufficiently use them, due to large inter-modal discrepancies. To address this issue, we propose a novel framework, namely Contrastively Augmented Transformer for multi-modal 3D object Detection (CAT-Det). Specifically, CAT-Det adopts a two-stream structure consisting of a Pointformer (PT) branch, an Imageformer (IT) branch along with a Cross-Modal Transformer (CMT) module. PT, IT and CMT jointly encode intra-modal and inter-modal long-range contexts for representing an object, thus fully exploring multi-modal information for detection. Furthermore, we propose an effective One-way Multimodal Data Augmentation (OMDA) approach via hierarchical contrastive learning at both the point and object levels, significantly improving the accuracy only by augmenting point-clouds, which is free from complex generation of paired samples of the two modalities. Extensive experiments on the KITTI benchmark show that CAT-Det achieves a new state-of-the-art, highlighting its effectiveness.
Building similarity graph...
Analyzing shared references across papers
Loading...
Yanan Zhang
Jiaxin Chen
Di Huang
Beihang University
State Key Laboratory of Software Development Environment
Building similarity graph...
Analyzing shared references across papers
Loading...
Zhang et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69dd506bfb7610310c101fea — DOI: https://doi.org/10.1109/cvpr52688.2022.00098