For a robot to operate effectively in human-centric environments, finding objects based on natural language is essential. Zero-shot object goal navigation is a significant challenge where robots must find unseen objects in new environments without prior knowledge. Existing methods often struggle with strategic exploration, leading to inefficient searches. In this study, we propose a hierarchical scene graph-based navigation system to address this challenge. Our core innovations are twofold: dynamically constructing a three-layer “room–workspace–object” hierarchical scene graph without manually pre-tuned parameters, and introducing a novel workspace-based searching strategy. By evaluating semantic relevance at the workspace level rather than the object level, the robot infers probable containers for a target, enabling focused, human-like exploration. Simulation results demonstrate that our system significantly outperforms existing state-of-the-art methods. Quantitatively, our approach improves the Success Rate (SR) by 26.8% (SR 0.4859) under distance-constrained settings and by 20.2% (SR 0.7360) under unconstrained settings, compared to the best baselines. These results validate that our framework offers a robust solution for zero-shot object goal navigation.
Kwon et al. (Tue,) studied this question.