Cross-modal hashing retrieval faces fundamental challenges from modality-modality (M-M) and modality-label (M-L) inconsistencies inherent in multimodal data. Existing methods rely on coarse-grained disentanglement to address these inconsistencies, but suffer from inaccurate semantic separation and modality-common semantic information loss during cross-modal alignment. Through comprehensive analysis, we demonstrate that coarse-grained approaches fail to effectively alleviate modality inconsistencies. Our validation experiments show that incorporating fine-grained features yields up to 6\% accuracy improvements over coarse-grained methods, confirming that fine-grained semantic components are critical for robust cross-modal retrieval. However, existing fine-grained methods require extensive pre-training and lack seamless integration into end-to-end frameworks. In this paper, we propose Inconsistency Alleviated Fine-Grained (IAFG) cross-modal hashing retrieval, a novel framework that enables semantic component-level disentanglement and alignment without extensive pre-training. Our approach introduces two key innovations: Semantic Component Disentanglement (SCD) that achieves fine-grained separation of modality-common and modality-unique information using learnable query vectors and competitive feature routing, and Fine-grained Semantic Alignment (FSA) that realizes accurate cross-modal alignment at the component level while preserving semantic details through component-level cross-attention and cross-modal triplet alignment. Extensive experiments on benchmark datasets demonstrate that our method achieves state-of-the-art performance with significant improvements in retrieval accuracy across different modalities.
Li et al. (Fri,) studied this question.