Robotic agents in home environments require scene representations that are both semantically expressive and adaptable to dynamic tasks. We present a relation-augmented, open-vocabulary 3D scene graph framework that combines hierarchical structure with keyframe-based vision-language reasoning. Using RGB-D sensing and segmentation-based object discovery, the system expands semantic relations via VLMs and applies anomaly filtering to improve consistency. This enables fine-grained semantic connectivity and supports dynamic task understanding in open-world scenarios. Evaluations across three indoor scenes show strong performance in node labeling (87.9%), edge precision (84.5%), and instruction grounding (83.3%). Supplementary experiments with object relocations further demonstrate consistent updates under dynamic changes, confirming the framework’s effectiveness for robust task-level understanding in open-world robotic interaction.
Lu et al. (Fri,) studied this question.