Forecasting the future state of a scene is a key computer vision task needed to build systems capable of proactive perception and decision-making in changing environments. This work addresses the problem of forecasting future scene graphs, where, given a video and a sequence of past graphs, one must predict objects and their relations in subsequent frames. Unlike existing approaches limited to static perception, the proposed method, GraphCast, takes into account semantic vision-language features of objects and their temporal dynamics. We introduce a model architecture based on object-centric encoding with a foundation transformer model, interaction modeling via a biaffine relation classification head, and a specialized object presence classifier. In addition, a temporal convolution module is used to extract features and improve robustness to noise. Experiments on the STAR and Action Genome datasets demonstrate that the proposed architecture outperforms existing baselines.
Building similarity graph...
Analyzing shared references across papers
Loading...
A. M. Trunova
D. A. Yudin
Doklady Mathematics
Moscow Institute of Physics and Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Trunova et al. (Fri,) studied this question.
synapsesocial.com/papers/69a91cbed6127c7a504bfb29 — DOI: https://doi.org/10.1134/s1064562425700334