Spatio-temporal video grounding (STVG) aims to precisely locate a spatio-temporal tube in an untrimmed video corresponding to a given language description. Many existing methods decouple spatial and temporal grounding as separate tasks, missing the strong interdependencies between the two, which are crucial for accurately aligning spatial regions (such as objects) with their motion over time. Thus, to enhance spatio-temporal associations, we introduce a new Prior-Driven Transformer Network (PDTNet) with predicted temporal boundaries as priors to guide object bounding boxes for improved spatial grounding over time. Firstly, PDTNet employs a temporal prior, termed reference query, to enhance discriminability between language-related and language-irrelevant visual content, improving temporal boundary localization. Further, the context within predicted temporal boundaries serves as another prior knowledge to modulate spatial features. We also introduce a prediction-aware Gaussian prior to precise object localization. This ensures consistent tube construction and accurate object localization. Extensive experiments on STVG benchmarks validate the effectiveness of PDTNet. Code is available at https://github.com/tongzhang111/PDTNet .
Wang et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: