What question did this study set out to answer?

The central aim is to improve LiDAR point cloud semantic segmentation by transferring knowledge from vision foundation models.

April 12, 2026Open Access

Distilling Vision Foundation Models into LiDAR Networks via Manifold-Aware Topological Alignment

Key Points

The central aim is to improve LiDAR point cloud semantic segmentation by transferring knowledge from vision foundation models.
Developed Cross-Modal Collaborative Manifold Distillation (CMCMD) framework.
Utilized Adaptive Relation Convolution (ARConv) backbone for feature extraction.
Implemented Unified Bidirectional Mapping Module (UBMM) for 2D-3D interaction.
Employed Manifold-Aware Topological Distillation (MATD) for aligning affinity structures.
Achieved mIoU values of 72.9% on SemanticKITTI and 81.2% on nuScenes.
Outperformed existing distillation baselines.
Approached the performance of multimodal fusion while lowering inference costs.

Abstract

LiDAR point cloud semantic segmentation is essential for autonomous driving, yet LiDAR-only methods remain constrained by sparsity and limited texture cues. We propose Cross-Modal Collaborative Manifold Distillation (CMCMD), which transfers open-world semantic priors from the DINOv3 Vision Foundation Model to a LiDAR student network. The framework combines an Adaptive Relation Convolution (ARConv) backbone with geometry-conditioned aggregation, a Unified Bidirectional Mapping Module (UBMM) for explicit 2D–3D interaction, and Manifold-Aware Topological Distillation (MATD), which aligns inter-sample affinity structures in a shared latent manifold rather than enforcing pointwise feature matching. By preserving relational topology instead of absolute feature coordinates, CMCMD mitigates negative transfer across heterogeneous modalities. Experiments on SemanticKITTI and nuScenes yield mIoU values of 72.9% and 81.2%, respectively, surpassing the compared distillation baselines and approaching the performance of multimodal fusion methods at lower inference cost. Additional evaluation on real-world campus scenes further supports the cross-domain robustness of the proposed framework.

Distilling Vision Foundation Models into LiDAR Networks via Manifold-Aware Topological Alignment

Key Points

Abstract

Cite This Study