LiDAR point cloud semantic segmentation is essential for autonomous driving, yet LiDAR-only methods remain constrained by sparsity and limited texture cues. We propose Cross-Modal Collaborative Manifold Distillation (CMCMD), which transfers open-world semantic priors from the DINOv3 Vision Foundation Model to a LiDAR student network. The framework combines an Adaptive Relation Convolution (ARConv) backbone with geometry-conditioned aggregation, a Unified Bidirectional Mapping Module (UBMM) for explicit 2D–3D interaction, and Manifold-Aware Topological Distillation (MATD), which aligns inter-sample affinity structures in a shared latent manifold rather than enforcing pointwise feature matching. By preserving relational topology instead of absolute feature coordinates, CMCMD mitigates negative transfer across heterogeneous modalities. Experiments on SemanticKITTI and nuScenes yield mIoU values of 72.9% and 81.2%, respectively, surpassing the compared distillation baselines and approaching the performance of multimodal fusion methods at lower inference cost. Additional evaluation on real-world campus scenes further supports the cross-domain robustness of the proposed framework.
Yang et al. (Thu,) studied this question.