Los puntos clave no están disponibles para este artículo en este momento.
Medical vision-language representation learning has garnered considerable attention owing to its applicability to extracting generic representations from the image and text modality. However, it still remains challenging to acquire a more comprehensive understanding of intra- and inter-modal semantic knowledge. In this paper, we propose a Cross-Modal Multi-Teacher Contrastive Distillation (CMCD) architecture, which aims to comprehensively learn medical vision-language representation in a unified multi-teacher framework. Specifically, a cross-modal knowledge distillation (CKD) module is designed to refine reconstructed semantics under an additional supervision signal generated by momentum teachers from the other modality, achieving more robust semantic interaction across modalities. To better alleviate the heterogeneity and semantic gaps, the multi-level contrastive learning (MCL) module is conceived to align features of both intra- and inter-modal via contrastive learning from multi-level perspectives. Extensive experiments on two medical downstream tasks, i.e., Med-VQA and Med-ITC, demonstrate that our CMCD consistently outperforms the state-of-the-art methods.
Chen et al. (Mon,) studied this question.
Synapse has enriched 3 closely related papers on similar clinical questions. Consider them for comparative context: