Recent advances in LiDAR representation learning with limited annotations show strong promise. Existing well-performed methods mainly focus on distilling the 2D representation into the 3D representation via superpixels. Superpixels are used to construct the cross-modal contrastive learning, leading to semantic ambiguity of 3D features belonging to the same object and impairing the performance. To this end, we aim to leverage unlabeled LiDAR-camera pairs to design a novel pre-training pipeline, which learns from category space directly and pulls the 3D features belonging to the same object close. Specifically, we obtain autolabeled 2D object boxes with a fixed 2D open-vocabulary object detector and transform the labeled 2D object boxes into high-quality pixel-wise label maps with a box-to-label-maps generation algorithm. Based on the pseudo labels, we present a dual-space pre-training 3D network that recognizes accurate categories from the semantic priors of paired 3D points and segments complete objects. Furthermore, we propose a module named AdaptPro to improve performance further when fine-tuning the 3D network under limited annotations, aiming to explore the unpaired 3D features that lack 2D correspondences via category prototypes. The experimental results show that our method achieves state-of-the-art performances on both the nuScenes and SemanticKITTI benchmark datasets. Code is avialable at https://github.com/dengq7/Box4Scene.
Deng et al. (Thu,) studied this question.