Los puntos clave no están disponibles para este artículo en este momento.
Abstract Temporal Sentence Grounding in Video aims to locate target video segments related to the input textual query within a given video, which is severely affected by the biases, leading to a significant decrease in the generalization performance. However, existing debias methods only remove a limited amount of bias and do not consider the more significant multimodal bias. In this work, we verify the existence of multimodal bias through comparative experiments, and further propose a Dual-guided Multi-modal Bias Removal Strategy (DMBR) to address this issue. Based on the span-based natural language video localization paradigm, DMBR extracts salient text concepts, such as verbs, nouns, and numerals, and visual concepts, such as actions contained in the input video, to guide the generation of multimodal biases, which can simulate all potential multimodal biases in a dual and complementary manner through Language Guided Multi-modal Bias Generator and Video Guided Multi-modal Bias Generator. Meanwhile, we produce the adversarial training paradigm. The bias generators is expected to generate multi-modal bias samples that can deceive the discriminator and the backbone network, while the backbone network aims to produce correct predictions even in the presence of biased features and the discriminator aims to accurately predict whether the sample contains bias. This strategy forces the backbone model to accurately identify and effectively remove the influence of multimodal biases, thus improving the robustness of the model. We implement our DMBR on multiple existing backbones under widely used benchmarks Charades-CD and ActivityNet-CD datasets, which demonstrate the effectiveness of our debias strategy.
Ruan et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: