Open-vocabulary 3D scene segmentation serves as a fundamental capability of human perception in computer vision, as it enables systems to recognize and segment arbitrary objects in complex environments. However, existing approaches often struggle to generalize to unseen categories and lack the ability to jointly exploit geometric structure and semantic information. In this paper, we introduce OV3DSeg-VGGT, a novel framework that constructs a 3D scene segmentation model by combining distilled geometric priors from the pretrained visual transformer with semantic knowledge. Our method leverages temporally consistent 2D segmentation and cross-modal embeddings from CLIP to construct robust cross-view instance representations. By fine-tuning visual geometry transformer with a contrastive learning objective and introducing a CLIP-guided distillation projector, we align geometric features with semantic priors, enabling segmentation with strong generalization to novel categories. Extensive experiments show that OV3DSeg-VGGT outperforms existing state-of-the-art baselines and achieves generalization in open-vocabulary 3D segmentation.
Zhou et al. (Sun,) studied this question.