Partially Relevant Video Retrieval (PRVR) aims to retrieve videos that match a given textual query only partially. This task is inherently challenging due to the modality gap between text and video, which is further exacerbated by the partial semantic correspondence between linguistic descriptions and visual content. To address these challenges, we propose a bidirectional cross-modal alignment mechanism that collaboratively optimizes both visual and textual modalities. In the visual modality, a major difficulty lies in the absence of visual cues that directly correspond to textual semantics, limiting the models ability to align visual representations with textual meanings under unsupervised conditions. To overcome this issue, we construct a semantic-visual association library, which stores paired visual and textual features with semantic annotations. During training, the model dynamically retrieves the most semantically similar visual samples from this library based on the current visual feature vector. These retrieved samples, preliminarily associated with semantics via cross-modal matching, are used to form dynamic anchors that guide visual representation learning. By leveraging these enriched visual features, the model progressively refines the visual representations to achieve better alignment with the corresponding textual inputs, thereby enhancing cross-modal consistency. In the textual modality, we enhance textual representations by integrating semantically aligned visual features selected from the same association library, further narrowing the modality gap. Extensive experiments on benchmark datasets under partial semantic correspondence scenarios demonstrate that our method achieves state-of-the-art performance. The source code of the paper is available at https://github.com/cyanlll/BOA.
Li et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: