What does this research mean for the field?

SESQ significantly improves 3D visual grounding performance by 2.68% on ScanRefer and 4.9% on the challenging Nr3D 'Hard' subset compared to baseline methods. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The study aims to improve 3D visual grounding by enhancing object localization using spatial and semantic techniques.

March 3, 2026Open Access

SESQ: Spatially Aware Encoding and Semantically Guided Querying for 3D Grounding

Key Points

The study aims to improve 3D visual grounding by enhancing object localization using spatial and semantic techniques.
Developed SESQ framework combining Spatially Aware Encoding and Semantically Guided Querying.
Introduced Rotary Spatially Aware Encoder (RSAE) with Rotary Position Embeddings for better spatial representation.
Implemented Semantic Query Initialization (SQI) to align textual and visual features for object queries.
Achieved 2.68% improvement in overall accuracy on ScanRefer benchmark.
Obtained 4.9% enhancement on the challenging Nr3D 'Hard' subset.
Increased overall accuracy by 1.1% on Sr3D benchmark.

Abstract

3D visual grounding is a fundamental task for human–machine interaction, aiming to localize specific objects in complex 3D point clouds based on natural language descriptions. Despite recent advancements, existing Transformer-based architectures often rely on absolute position embeddings and heuristic query initialization, which lack the capacity to capture fine-grained relative spatial dependencies and fail to effectively filter out scene clutter. In this paper, we propose SESQ, a novel framework that synergizes Spatially Aware Encoding and Semantically Guided Querying for 3D grounding. Our approach introduces two key innovations. First, we propose the Rotary Spatially Aware Encoder (RSAE), which incorporates Rotary Position Embeddings (RoPE) into the self-attention layers. By transforming 3D coordinates into a rotary representation, RSAE enables the model to inherently capture relative spatial distances and maintains geometric consistency throughout the encoding stage. Second, a Semantic Query Initialization (SQI) module is designed to initialize object queries by explicitly computing the cross-modal similarity between textual embeddings and visual point cloud features. By replacing traditional heuristic-based selection with semantic-aware alignment, SQI ensures that the decoding process originates from contextually relevant object candidates, significantly reducing the impact of task-irrelevant distractors. Extensive experiments on ScanRefer and ReferIt3D (Nr3D/Sr3D) benchmarks demonstrate the effectiveness of our framework. Compared to the baseline EDA, our method achieves a significant performance gain of 2.68% in overall Acc@0.5 on ScanRefer, a 4.9% improvement on the challenging Nr3D “Hard” subset, and a 1.1% increase in overall Acc@0.25 on Sr3D.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Li et al. (Sun,) studied this question.

synapsesocial.com/papers/69a67ed1f353c071a6f0a488 https://doi.org/https://doi.org/10.3390/computers15030145

Bookmark

View Full Paper