What question did this study set out to answer?

The aim is to enhance fine-grained semantic alignment in medical pathology vision-language models.

June 23, 2026Open Access

Fine-grained alignment in medical pathology vision-language models via variational distillation

Key Points

The aim is to enhance fine-grained semantic alignment in medical pathology vision-language models.
Proposed a Variational Distillation framework for Medical Pathology V-L models.
Implemented a dual-loop optimization mechanism to distill and align semantic signals from text and external LLM knowledge.
Used variational latent distributions and KL-based loss for modeling semantic ambiguity.
Outperformed existing methods across five pathology benchmarks in class generalization, few-shot learning, and cross-organ transfer.
Improved sensitivity to subtle semantic variations crucial for pathology image understanding.
Achieved more precise and pathology-aware image-text matching.

Abstract

Pre-trained vision-language (V-L) models exhibit promising performance across various general-domain tasks. However, they fall short in medical pathology due to the critical need for fine-grained semantic alignment, which is essential for distinguishing subtle visual patterns across categories. This limitation is not merely due to domain gaps but stems from the inability to capture detailed, pathology-specific semantics. Previous efforts leveraging large language models (LLMs) or cross-modal training often introduce redundant or ambiguous cues, ultimately weakening generalization. To explicitly enhance fine-grained alignment, we propose a Variational Distillation framework tailored for Medical Pathology V-L models. This method introduces a dual-loop optimization mechanism that jointly distills and aligns semantic signals from both textual inputs and external LLM knowledge. Specifically, we use variational latent distributions to model semantic ambiguity and apply a KL-based loss to reduce differences between signals. This encourages the model to retain robust and generalizable features, enabling improved sensitivity to subtle semantic variations critical in pathology image understanding. During cross-modal alignment, the proposed method further amplifies modality-shared semantics while suppressing modality-specific noise and task-irrelevant factors, yielding more precise and pathology-aware image-text matching. Extensive experiments on five pathology benchmarks across three settings, including class generalization, few-shot learning, and cross-organ transfer, demonstrate that the proposed method consistently outperforms the existing approaches.

Fine-grained alignment in medical pathology vision-language models via variational distillation

Key Points

Abstract

Cite This Study