Vision-language models in healthcare face a critical limitation, i. e. , the modality gap, where image and text embeddings occupy distantly separated regions in shared representation space. This is reinforced by traditional contrastive learning objectives, and manifests itself through fundamental constraints in cross-modal understanding and downstream task performance. Existing approaches focus on addressing input-level requirements, however, the geometric constraints imposed by multimodal contrastive learning remain largely unexplored. We propose a novel framework that synergistically combines contrastive learning and entropy-regularized optimal transport for medical modality alignment, simultaneously tackling both instance-level and distribution-level alignment. First, a medical condition-driven association strategy is introduced, which defines positive pairs through shared pathologies, rather than rigid image-text correspondence. Next, an intra-modality negative sampling scheme is designed, which constrains intra-modal contrastive pressure to prevent reinforcement of cross-modal separation. These operate in tandem with a lightweight embedding refinement network, which reshapes pretrained BiomedCLIP embeddings into diagnosis-aware clusters, supporting compatibility with clinical pipelines. The approach leads to significant improvements in reducing the modality gap, demonstrated through increases in alignment scores (0. 33-0. 73), and improving retrieval precision (22%-33%), zero-shot classification accuracy (13%-48%) and a 4. 27 times reduction in clustering dispersion metrics on standard benchmarks (CheXpert₂00×5, MIMIC₂00×5, RSNA, and COVID).
Lahmar et al. (Tue,) studied this question.