What question did this study set out to answer?

The research aims to enhance 3D scene segmentation by integrating geometric priors and semantic knowledge to improve generalization to new categories.

March 30, 2026Open Access

OV3DSeg-VGGT: Open-Vocabulary 3D Segmentation with Visual Geometry-Grounded Transformers

Key Points

The research aims to enhance 3D scene segmentation by integrating geometric priors and semantic knowledge to improve generalization to new categories.
Developed OV3DSeg-VGGT framework combining geometric priors from visual transformers and semantic knowledge.
Utilized CLIP for temporally consistent 2D segmentation and cross-view instance representations.
Employed contrastive learning to fine-tune the visual geometry transformer for better alignment of features.
OV3DSeg-VGGT outperformed existing state-of-the-art methods in 3D segmentation accuracy.
Demonstrated strong generalization to novel categories not seen during training.

Abstract

Open-vocabulary 3D scene segmentation serves as a fundamental capability of human perception in computer vision, as it enables systems to recognize and segment arbitrary objects in complex environments. However, existing approaches often struggle to generalize to unseen categories and lack the ability to jointly exploit geometric structure and semantic information. In this paper, we introduce OV3DSeg-VGGT, a novel framework that constructs a 3D scene segmentation model by combining distilled geometric priors from the pretrained visual transformer with semantic knowledge. Our method leverages temporally consistent 2D segmentation and cross-modal embeddings from CLIP to construct robust cross-view instance representations. By fine-tuning visual geometry transformer with a contrastive learning objective and introducing a CLIP-guided distillation projector, we align geometric features with semantic priors, enabling segmentation with strong generalization to novel categories. Extensive experiments show that OV3DSeg-VGGT outperforms existing state-of-the-art baselines and achieves generalization in open-vocabulary 3D segmentation.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Zhou et al. (Sun,) studied this question.

synapsesocial.com/papers/69ca12d4883daed6ee0951d8 https://doi.org/https://doi.org/10.1016/j.visinf.2026.100311

Bookmark

View Full Paper