• Proposed a new 3D visual grounding method using relative geometric relationships. • Improved localization accuracy by 4.3% and inference speed by 43.8% compared to the previous best model. • Designed a novel feature fusion strategy to better align language and 3D data. • Enabled machines to more accurately locate objects in 3D scenes based on human language commands. Three-Dimensional Visual Grounding (3DVG) aims to locate objects in 3D scenes based on natural language queries. Existing methods typically rely on absolute position encoding of global objects to model spatial relationships for target localization. However, this often results in inadequate spatial understanding and redundant or invalid encoding. To address these limitations, we propose the Relative K-object Fusion Perception 3D Visual Grounding (3DRelKG) framework. By dynamically sampling a limited number of scene reference points, our approach enhances the modeling of spatial relationships for localization targets by learning only the positional and relative spherical geometric features of these reference points with respect to scene objects. Additionally, we introduce a heterogeneous feature fusion module, whose core is an information interaction mechanism based on the similarity matrix of heterogeneous features. This approach naturally avoids the problem of unreliable attention interaction weights that arise from directly calculating the similarity between heterogeneous features. Experiments on ScanRefer, SR3D, and NR3D demonstrate that our method outperforms state-of-the-art models, improving accuracy by 2.3%, 4.3%, and 2.1%, respectively, and increasing inference speed by 43.8%.
Wang et al. (Wed,) studied this question.