What type of study is this?

This is a Experimental Study study.

September 30, 2025

Learning Prediction-aware Prior in Transformer Network for Accurate Spatio-Temporal Video Grounding

Key Points

PDTNet improves object localization accuracy in video grounding, leading to better spatio-temporal alignment.
The model integrates a prediction-aware Gaussian prior for precise object localization and tube construction.
By incorporating temporal priors, PDTNet enhances the alignment of spatial features with language descriptions.
Extensive experiments validate the effectiveness of the proposed spatio-temporal video grounding method.

Abstract

Spatio-temporal video grounding (STVG) aims to precisely locate a spatio-temporal tube in an untrimmed video corresponding to a given language description. Many existing methods decouple spatial and temporal grounding as separate tasks, missing the strong interdependencies between the two, which are crucial for accurately aligning spatial regions (such as objects) with their motion over time. Thus, to enhance spatio-temporal associations, we introduce a new Prior-Driven Transformer Network (PDTNet) with predicted temporal boundaries as priors to guide object bounding boxes for improved spatial grounding over time. Firstly, PDTNet employs a temporal prior, termed reference query, to enhance discriminability between language-related and language-irrelevant visual content, improving temporal boundary localization. Further, the context within predicted temporal boundaries serves as another prior knowledge to modulate spatial features. We also introduce a prediction-aware Gaussian prior to precise object localization. This ensures consistent tube construction and accurate object localization. Extensive experiments on STVG benchmarks validate the effectiveness of PDTNet. Code is available at https://github.com/tongzhang111/PDTNet .

Ask AI

Helpful

Bookmark