What does this research mean for the field?

DyGEnc improves large language model performance in answering questions about human–object interactions in dynamic scenes. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The study aims to enhance question answering in dynamic environments using a new method called DyGEnc.

March 4, 2026Open Access

DyGEnc: Encoding a Sequence of Textual Scene Graphs to Reason and Answer Questions in Dynamic Scenes

Read Full Paperexternally

Key Points

The study aims to enhance question answering in dynamic environments using a new method called DyGEnc.
Developed DyGEnc for encoding sequences of textual scene graphs.
Integrated spatio-temporal representations with large language models.
Evaluated DyGEnc on STAR and AGQA datasets to assess performance.
Extended method to process images using foundation models for scene graph extraction.
DyGEnc improves the performance of large language models in answering queries about human-object interactions.
Validation through robotic experiments demonstrates the method's real-world applicability.

Abstract

Analyzing events in dynamic environments poses a fundamental challenge in the development of intelligent agents and robots capable of interacting with humans. Current approaches predominantly rely on visual–text models; however, these methods often capture information implicitly from images, lacking interpretable and structured spatio-temporal object representations and their relationships. To address this issue, we introduce DyGEnc—a novel method for dynamic graph encoding. This method integrates compressed spatio-temporal representation with the cognitive capabilities of large language models. The purpose of this integration is to enable advanced question answering based on sequences of textual scene graphs. Extensive evaluations on the STAR and AGQA datasets demonstrate that DyGEnc improves large language model performance when addressing queries related to the history of human–object interactions. Furthermore, the proposed method can be extended to process input images by leveraging foundation models to extract explicit textual scene graphs, as validated by the evaluation results. We expect these findings to contribute to the development of robust and compact graph-based memory for long-horizon reasoning in real-world applications, as demonstrated in a robotic experiment conducted using a wheeled manipulator platform.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Sergey Linok

Independent University of Moscow

Vadim Semenov

Independent University of Moscow

Anastasia Trunova

Independent University of Moscow

Journals

Technologies

Actions

Institutions

Innopolis University

Independent University of Moscow

Cognitive Research (United States)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

DyGEnc: Encoding a Sequence of Textual Scene Graphs to Reason and Answer Questions in Dynamic Scenes

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider