What type of study is this?

This is a Quantitative Study study.

October 8, 2025Open Access

Contrastive Alignment with Semantic Gap-Aware Corrections in Text-Video Retrieval

Key Points

GARE improves retrieval accuracy by addressing the modality gap and optimizing gradient conflicts.
Using a learnable Delta_ij, the framework enhances stability in retrieval tasks, leading to better performance.
Experiments show consistent improvement across four retrieval benchmarks, confirming the effectiveness of GARE.
Incorporating gradient supervision enables structure-aware corrections, making learning more stable and interpretable.

Abstract

Recent advances in text-video retrieval have been largely driven by contrastive learning frameworks. However, existing methods overlook a key source of optimization tension: the separation between text and video distributions in the representation space (referred to as the modality gap), and the prevalence of false negatives in batch sampling. These factors lead to conflicting gradients under the InfoNCE loss, impeding stable alignment. To mitigate this, we propose GARE, a Gap-Aware Retrieval framework that introduces a learnable, pair-specific increment Deltaᵢj between text tᵢ and video vⱼ to offload the tension from the global anchor representation. We first derive the ideal form of Deltaᵢj via a coupled multivariate first-order Taylor approximation of the InfoNCE loss under a trust-region constraint, revealing it as a mechanism for resolving gradient conflicts by guiding updates along a locally optimal descent direction. Due to the high cost of directly computing Deltaᵢj, we introduce a lightweight neural module conditioned on the semantic gap between each video-text pair, enabling structure-aware correction guided by gradient supervision. To further stabilize learning and promote interpretability, we regularize Delta using three components: a trust-region constraint to prevent oscillation, a directional diversity term to promote semantic coverage, and an information bottleneck to limit redundancy. Experiments across four retrieval benchmarks show that GARE consistently improves alignment accuracy and robustness to noisy supervision, confirming the effectiveness of gap-aware tension mitigation.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Xiao et al. (Sun,) studied this question.

synapsesocial.com/papers/68e6a0f4718ef0a556b34094 https://doi.org/https://doi.org/10.48550/arxiv.2505.12499

Bookmark

View Full Paper