What question did this study set out to answer?

To develop a method for accurately retrieving videos that partially match textual descriptions despite modality gaps.

February 5, 2026

Bidirectional Cross-Modal Collaborative Alignment via Semantic-Guided Visual Embeddings for Partially Relevant Video Retrieval

Puntos clave

To develop a method for accurately retrieving videos that partially match textual descriptions despite modality gaps.
Developed a bidirectional cross-modal alignment mechanism.
Constructed a semantic-visual association library storing paired visual and textual features.
Dynamic retrieval of semantically similar visual samples to support representation learning.
Achieved state-of-the-art performance on benchmark datasets for partially relevant video retrieval.
Demonstrated improved alignment between visual representations and textual semantics.

Resumen

Partially Relevant Video Retrieval (PRVR) aims to retrieve videos that match a given textual query only partially. This task is inherently challenging due to the modality gap between text and video, which is further exacerbated by the partial semantic correspondence between linguistic descriptions and visual content. To address these challenges, we propose a bidirectional cross-modal alignment mechanism that collaboratively optimizes both visual and textual modalities. In the visual modality, a major difficulty lies in the absence of visual cues that directly correspond to textual semantics, limiting the models ability to align visual representations with textual meanings under unsupervised conditions. To overcome this issue, we construct a semantic-visual association library, which stores paired visual and textual features with semantic annotations. During training, the model dynamically retrieves the most semantically similar visual samples from this library based on the current visual feature vector. These retrieved samples, preliminarily associated with semantics via cross-modal matching, are used to form dynamic anchors that guide visual representation learning. By leveraging these enriched visual features, the model progressively refines the visual representations to achieve better alignment with the corresponding textual inputs, thereby enhancing cross-modal consistency. In the textual modality, we enhance textual representations by integrating semantically aligned visual features selected from the same association library, further narrowing the modality gap. Extensive experiments on benchmark datasets under partial semantic correspondence scenarios demonstrate that our method achieves state-of-the-art performance. The source code of the paper is available at https://github.com/cyanlll/BOA.

Preguntar a la IA

Me gusta

Guardar