Key points are not available for this paper at this time.
The task of spatio-temporal video grounding involves identifying the spatial and temporal regions in a video that correspond to the objects or actions described in a given textual description. However, current models used for spatio-temporal video grounding often rely heavily on spatio-temporal priors to make the predictions. As a result, they may suffer from spurious correlations and lack the ability to generalize well to new or diverse scenarios. To overcome this limitation, we introduce a deconfounded multimodal learning framework, which utilizes a structural causal model to treat dataset biases as a confounder and subsequently remove their confounding effect. Through this framework, we can perform causal intervention on the multimodal input and derive an unbiased estimation formula through the do-calculus technique. In order to tackle the challenge of diverse and often unobservable confounders, we further propose a novel retrieval-based approach with a causal mask mechanism. The proposed method leverages analogical reasoning to facilitate deconfounded learning and mitigate dataset biases, enabling unbiased spatio-temporal prediction without explicitly modeling the confounding factors. Extensive experiments on two challenging benchmarks have well verified the effectiveness and rationality of our proposed solution.
Wang et al. (Thu,) studied this question.