With the rapid growth of internet multimedia data, cross-modal retrieval techniques have garnered significant attention. Given the inherent complexity and non-intuitive nature of cross-modal relationships, tuning pre-trained Large Multimodal Models (LMMs) with cross-modal data has become a mainstream approach. However, cross-modal data commonly exhibit inter-modal information asymmetry and intra-modal distribution diversity. Faced with these challenges, existing paradigms tend to learn ambiguous and asymmetric cross-modal associations, which introduce semantic noise. In addition, their limited adaptability to the high diversity of real-world content further hinders optimal retrieval performance. To address these challenges, this paper proposes the A daptive C o-operative K nowledge E nhancement (ACKE) method, which comprises the Uncertainty-Aware Inspire Potential (UAIP) and Adaptive Co-operative Prompt (ACP) strategies. UAIP utilizes generative LMMs to generate multi-perspective descriptions that enrich semantic information, while employing Dempster-Shafer Theory (DST) to quantify their semantic uncertainty and adjust contribution weights, reducing inaccurate relational mappings and balancing information asymmetry. ACP constructs a prompt pool where instance-specific visual prompts are dynamically selected and projected into text prompts, which collaborate to guide modal encoders toward deep semantic consensus, thus mitigating alignment bias from intra-modal distribution diversity and improving accuracy. Extensive experiments are conducted on two widely used datasets, Flickr30K and MS-COCO, demonstrating the effectiveness of our proposed method. The code is available at https://github.com/nynu-BDAI/ACKE.
Huang et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: