Analyzing events in dynamic environments poses a fundamental challenge in the development of intelligent agents and robots capable of interacting with humans. Current approaches predominantly rely on visual–text models; however, these methods often capture information implicitly from images, lacking interpretable and structured spatio-temporal object representations and their relationships. To address this issue, we introduce DyGEnc—a novel method for dynamic graph encoding. This method integrates compressed spatio-temporal representation with the cognitive capabilities of large language models. The purpose of this integration is to enable advanced question answering based on sequences of textual scene graphs. Extensive evaluations on the STAR and AGQA datasets demonstrate that DyGEnc improves large language model performance when addressing queries related to the history of human–object interactions. Furthermore, the proposed method can be extended to process input images by leveraging foundation models to extract explicit textual scene graphs, as validated by the evaluation results. We expect these findings to contribute to the development of robust and compact graph-based memory for long-horizon reasoning in real-world applications, as demonstrated in a robotic experiment conducted using a wheeled manipulator platform.
Building similarity graph...
Analyzing shared references across papers
Loading...
Sergey Linok
Independent University of Moscow
Vadim Semenov
Independent University of Moscow
Anastasia Trunova
Independent University of Moscow
Technologies
Innopolis University
Independent University of Moscow
Cognitive Research (United States)
Building similarity graph...
Analyzing shared references across papers
Loading...
Linok et al. (Sun,) studied this question.
synapsesocial.com/papers/69a7cc4cd48f933b5eed7dfc — DOI: https://doi.org/10.3390/technologies14030150
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: