What question did this study set out to answer?

The aim is to create a more accurate method for identifying objects in 3D environments based on language commands.

March 14, 2026Open Access

Explicit geometric relationships under limited spatial reference points guide 3D visual grounding

Key Points

The aim is to create a more accurate method for identifying objects in 3D environments based on language commands.
Introduced the 3DRelKG framework to improve spatial relationship modeling.
Dynamic sampling of limited spatial reference points in scenes.
Developed a heterogeneous feature fusion module using a similarity matrix.
Achieved a 4.3% increase in localization accuracy.
Enhanced inference speed by 43.8% compared to previous models.
Outperformed state-of-the-art models on multiple datasets.

Abstract

• Proposed a new 3D visual grounding method using relative geometric relationships. • Improved localization accuracy by 4.3% and inference speed by 43.8% compared to the previous best model. • Designed a novel feature fusion strategy to better align language and 3D data. • Enabled machines to more accurately locate objects in 3D scenes based on human language commands. Three-Dimensional Visual Grounding (3DVG) aims to locate objects in 3D scenes based on natural language queries. Existing methods typically rely on absolute position encoding of global objects to model spatial relationships for target localization. However, this often results in inadequate spatial understanding and redundant or invalid encoding. To address these limitations, we propose the Relative K-object Fusion Perception 3D Visual Grounding (3DRelKG) framework. By dynamically sampling a limited number of scene reference points, our approach enhances the modeling of spatial relationships for localization targets by learning only the positional and relative spherical geometric features of these reference points with respect to scene objects. Additionally, we introduce a heterogeneous feature fusion module, whose core is an information interaction mechanism based on the similarity matrix of heterogeneous features. This approach naturally avoids the problem of unreliable attention interaction weights that arise from directly calculating the similarity between heterogeneous features. Experiments on ScanRefer, SR3D, and NR3D demonstrate that our method outperforms state-of-the-art models, improving accuracy by 2.3%, 4.3%, and 2.1%, respectively, and increasing inference speed by 43.8%.

Explicit geometric relationships under limited spatial reference points guide 3D visual grounding

Key Points

Abstract

Cite This Study