What question did this study set out to answer?

The aim is to develop a pre-training pipeline that learns 3D features directly from category space to improve representation accuracy.

April 3, 2026

Learning 3D Representation from Auto-labeled 2D Object Boxes

Key Points

The aim is to develop a pre-training pipeline that learns 3D features directly from category space to improve representation accuracy.
Utilized unlabeled LiDAR-camera pairs for training
Employed a fixed 2D object detector to generate auto-labeled 2D boxes
Created pixel-wise label maps from 2D boxes using a box-to-label-maps generation algorithm
Introduced a dual-space pre-training 3D network for enhanced category recognition and object segmentation
Implemented AdaptPro to optimize 3D network performance under limited annotations.
Achieved state-of-the-art performance on nuScenes and SemanticKITTI datasets
Demonstrated improved accuracy in recognizing 3D features compared to existing methods
Effectively segmented complete objects using the proposed approach.

Abstract

Recent advances in LiDAR representation learning with limited annotations show strong promise. Existing well-performed methods mainly focus on distilling the 2D representation into the 3D representation via superpixels. Superpixels are used to construct the cross-modal contrastive learning, leading to semantic ambiguity of 3D features belonging to the same object and impairing the performance. To this end, we aim to leverage unlabeled LiDAR-camera pairs to design a novel pre-training pipeline, which learns from category space directly and pulls the 3D features belonging to the same object close. Specifically, we obtain autolabeled 2D object boxes with a fixed 2D open-vocabulary object detector and transform the labeled 2D object boxes into high-quality pixel-wise label maps with a box-to-label-maps generation algorithm. Based on the pseudo labels, we present a dual-space pre-training 3D network that recognizes accurate categories from the semantic priors of paired 3D points and segments complete objects. Furthermore, we propose a module named AdaptPro to improve performance further when fine-tuning the 3D network under limited annotations, aiming to explore the unpaired 3D features that lack 2D correspondences via category prototypes. The experimental results show that our method achieves state-of-the-art performances on both the nuScenes and SemanticKITTI benchmark datasets. Code is avialable at https://github.com/dengq7/Box4Scene.

Mark Helpful

Bookmark

Relay