What question did this study set out to answer?

The research aims to address the modality gap in vision-language models for healthcare by enhancing cross-modal alignment.

April 17, 2026

Bridging the Modality Gap in Medical Vision-Language Models: A Hybrid Contrastive-Optimal Transport Framework for Enhanced Cross-Modal Alignment.

Key Points

The research aims to address the modality gap in vision-language models for healthcare by enhancing cross-modal alignment.
Proposed a combination of contrastive learning and entropy-regularized optimal transport.
Introduced a medical condition-driven association strategy to define positive pairs.
Developed an intra-modality negative sampling scheme to reduce intra-modal contrastive pressure.
Implemented a lightweight embedding refinement network for better embedding organization.
Achieved significant reduction in the modality gap, improving alignment scores from 0.33 to 0.73.
Increased retrieval precision by 22% to 33%.
Enhanced zero-shot classification accuracy by 13% to 48%.
Reduced clustering dispersion metrics by 4.27 times across standard benchmarks.

Abstract

Vision-language models in healthcare face a critical limitation, i. e. , the modality gap, where image and text embeddings occupy distantly separated regions in shared representation space. This is reinforced by traditional contrastive learning objectives, and manifests itself through fundamental constraints in cross-modal understanding and downstream task performance. Existing approaches focus on addressing input-level requirements, however, the geometric constraints imposed by multimodal contrastive learning remain largely unexplored. We propose a novel framework that synergistically combines contrastive learning and entropy-regularized optimal transport for medical modality alignment, simultaneously tackling both instance-level and distribution-level alignment. First, a medical condition-driven association strategy is introduced, which defines positive pairs through shared pathologies, rather than rigid image-text correspondence. Next, an intra-modality negative sampling scheme is designed, which constrains intra-modal contrastive pressure to prevent reinforcement of cross-modal separation. These operate in tandem with a lightweight embedding refinement network, which reshapes pretrained BiomedCLIP embeddings into diagnosis-aware clusters, supporting compatibility with clinical pipelines. The approach leads to significant improvements in reducing the modality gap, demonstrated through increases in alignment scores (0. 33-0. 73), and improving retrieval precision (22%-33%), zero-shot classification accuracy (13%-48%) and a 4. 27 times reduction in clustering dispersion metrics on standard benchmarks (CheXpert₂00×5, MIMIC₂00×5, RSNA, and COVID).

Bookmark

Bridging the Modality Gap in Medical Vision-Language Models: A Hybrid Contrastive-Optimal Transport Framework for Enhanced Cross-Modal Alignment.

Key Points

Abstract

Cite This Study