Cross-modal retrieval facilitates more flexible information access and improves semantic understanding across different modalities. However, traditional cross-modal retrieval models rely on well-aligned datasets, which are often labor-intensive and costly to obtain. In real-world applications, data inevitably includes mismatched pairs, and these semantically inconsistent pairs can significantly degrade retrieval performance. Previous approaches have assumed ideal loss value distributions to optimize models for accurate semantic matching through soft-label estimation. However, the absence of hierarchical semantic correlation learning limits the effectiveness of these models in scenarios involving partial mismatches. To address these challenges, we propose Exploring Hierarchical Cross-Modal Correlation Consistency (EH3C) for cross-modal retrieval under partially mismatched conditions. Specifically, our approach first leverages neighborhood correlation distributions among samples to optimize cross-modal alignment, without assuming ideal distributions. This allows for the measurement of soft matching degrees between cross-modal data pairs and facilitates the effective learning of their positive correlations. Next, we enhance inter-class separability through intra-modal correlation learning by exploiting negative correlations between reliable negative sample pairs, thus enabling a more comprehensive exploration of cross-modal correlations. Finally, to assess the effectiveness and robustness of our approach, we conducted extensive experiments on three benchmark datasets. The results demonstrate that the proposed EH3C significantly improves cross-modal retrieval performance in scenarios involving partial mismatches.
Liu et al. (Thu,) studied this question.