What question did this study set out to answer?

The aim is to enhance 3D visual localization by addressing limitations in existing methods related to modality dependence and adaptability.

February 26, 2026Open Access

跨模态交互学习与迭代融合的3d视觉定位

Puntos clave

The aim is to enhance 3D visual localization by addressing limitations in existing methods related to modality dependence and adaptability.
Extracted features from point clouds and text using specific encoders.
Implemented a transformer-based module for enhancing point cloud features.
Utilized a symmetric interaction learning module to capture deep connections between point cloud and text features.
Developed a cross-modal iterative fusion module for integrating information.
Achieved accuracy scores of 86.19% and 69.68% in the unique subset of ScanRefer at Acc@0.25 and Acc@0.5.
Attained a localization accuracy of 65.20% in the easy subset of Nr3D.
Achieved a localization accuracy of 74.87% in the easy subset of Sr3D.
Demonstrated consistent localization performance across multiple 3D scenes.

Resumen

针对现有3D视觉定位方法存在的对单一模态信息依赖过强、视角变化适应性差以及跨模态特征融合效果有限的问题，提出了一种跨模态交互学习与迭代融合的3D视觉定位方法。该方法包括多模态特征提取与跨模态特征融合两个阶段。在特征提取阶段，分别采用点云编码器和文本编码器提取点云与文本特征，并引入点云的类别信息；在特征融合阶段，设计基于Transformer的点云特征增强模块，以提升点云特征的表达能力；通过对称交互学习模块捕捉点云与文本特征之间的深层关联，有效抑制无关特征干扰；跨模态迭代融合模块逐步融合跨模态信息，增强模型在复杂场景下的定位能力。实验结果表明，本文方法在ScanRefer、Nr3D和Sr3D这3个经典的3D视觉定位数据集上均取得了综合的精度提升。在ScanRefer的unique子集上，Acc@0.25和Acc@0.5分别达到了86.19%和69.68%；在Nr3D和Sr3D的easy子集上，定位准确率分别达到了65.20%和74.87%。本文方法在多种3D场景中均展现出稳定的定位能力，验证了其在增强多模态交互与跨模态融合方面的卓越性能。

Me gusta

Guardar

Ver artículo completo