Traditional 3D scene understanding methods heavily depend on 3D annotation and training, which allow for the identification of seen classes but struggle to recognize unseen classes. In this paper, we leverage the open vocabulary inference capabilities of pre-trained models, enabling the encoding of open vocabulary concepts. However, unlike existing open vocabulary 3D scene understanding methods, we propose a framework based on semantic probability. This innovation significantly reduces computational cost and is compatible with state-of-the-art two-stage 2D pre-trained models. Specifically, we align the text features from the CLIP model with the pixel features from the 2D pre-trained models, inferring semantic probability of image pixels based on similarity and projecting it onto 3D points. Subsequently, we introduce a point cloud pairs semantic fusion method to merge the point clouds, reducing the semantic probability of erroneous 3D points. Based on probability scores, we achieve 3D semantic segmentation on open vocabularies without any supervision or training. In addition, the semantic probability of 3D points can serve as pseudo-labels for 3D distillation, and the geometric features of the 3D scene can be exploited to improve the segmentation performance. Experimental results demonstrate that the proposed method exhibits competitive performance on publicly available benchmark datasets, including ScanNet, Matterport3D, and nuScenes.
Shen et al. (Wed,) studied this question.