March 18, 2024Open Access

Medical Vision-Language Representation Learning with Cross-Modal Multi-Teacher Contrastive Distillation

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

Medical vision-language representation learning has garnered considerable attention owing to its applicability to extracting generic representations from the image and text modality. However, it still remains challenging to acquire a more comprehensive understanding of intra- and inter-modal semantic knowledge. In this paper, we propose a Cross-Modal Multi-Teacher Contrastive Distillation (CMCD) architecture, which aims to comprehensively learn medical vision-language representation in a unified multi-teacher framework. Specifically, a cross-modal knowledge distillation (CKD) module is designed to refine reconstructed semantics under an additional supervision signal generated by momentum teachers from the other modality, achieving more robust semantic interaction across modalities. To better alleviate the heterogeneity and semantic gaps, the multi-level contrastive learning (MCL) module is conceived to align features of both intra- and inter-modal via contrastive learning from multi-level perspectives. Extensive experiments on two medical downstream tasks, i.e., Med-VQA and Med-ITC, demonstrate that our CMCD consistently outperforms the state-of-the-art methods.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo

Cite This Study

Chen et al. (Mon,) studied this question.

synapsesocial.com/papers/68e7397eb6db6435876b2a45 https://doi.org/https://doi.org/10.1109/icassp48485.2024.10447344

Also Consider

Synapse has enriched 3 closely related papers on similar clinical questions. Consider them for comparative context:

Me gusta

Guardar

Ver artículo completo