What question did this study set out to answer?

This research aims to enhance the accuracy of semantic segmentation in airborne point clouds by addressing the limitations of zero-shot methods.

June 9, 2026Open Access

Consistency-Guided Distillation from Vision Foundation Models for Zero-Shot Airborne Point Cloud Segmentation

Key Points

This research aims to enhance the accuracy of semantic segmentation in airborne point clouds by addressing the limitations of zero-shot methods.
Developed a geometry-constrained pseudo-label generation and purification framework.
Implemented a dual-branch design utilizing SAM3 for open-vocabulary semantics and SAM2 for class-agnostic instances.
Used a Masked Cross-Entropy Loss to supervise a 3D sparse convolutional network with purified pseudo-labels.
Achieved an mIoU improvement from 52.15% to 63.45% on the H3D dataset.
Increased mIoU from 29.52% to 58.51% on the Turin3D dataset.
Demonstrated enhanced recovery of small-scale targets frequently submerged in the background.

Abstract

Semantic segmentation of large-scale airborne point clouds traditionally relies on labor-intensive 3D manual annotations. While recent zero-shot methods attempt to alleviate this burden by distilling knowledge from 2D Vision–Language Models (VLMs) via 2D-to-3D projection, they suffer from performance degradation in complex urban environments. Specifically, lacking 3D geometric awareness, 2D VLMs frequently exhibit “semantic bleeding”, where large-scale background categories (e.g., ground) erroneously submerge small-scale targets (e.g., vehicles and street elements). To address this issue, we propose a geometry-constrained pseudo-label generation and purification framework. Our approach tackles the problem through a dual-branch design: extracting open-vocabulary semantics via SAM3-based multi-view projection while simultaneously deriving sharp, class-agnostic instances using SAM2 on Gamma-transformed elevation maps. By introducing a geometric–semantic consistency module, we evaluate the internal semantic purity and external spatial homogeneity of these instances, detecting and filtering out semantic misclassifications. The purified pseudo-labels are then used to supervise a 3D sparse convolutional network via a Masked Cross-Entropy Loss. Experiments on the H3D and Turin3D datasets demonstrate that our method recovers small-scale targets that are prone to being submerged, outperforming existing zero-shot baselines by improving mIoU from 52.15% to 63.45% on H3D and from 29.52% to 58.51% on Turin3D, thereby narrowing the performance gap with fully-supervised approaches.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper