Pre-trained vision-language (V-L) models exhibit promising performance across various general-domain tasks. However, they fall short in medical pathology due to the critical need for fine-grained semantic alignment, which is essential for distinguishing subtle visual patterns across categories. This limitation is not merely due to domain gaps but stems from the inability to capture detailed, pathology-specific semantics. Previous efforts leveraging large language models (LLMs) or cross-modal training often introduce redundant or ambiguous cues, ultimately weakening generalization. To explicitly enhance fine-grained alignment, we propose a Variational Distillation framework tailored for Medical Pathology V-L models. This method introduces a dual-loop optimization mechanism that jointly distills and aligns semantic signals from both textual inputs and external LLM knowledge. Specifically, we use variational latent distributions to model semantic ambiguity and apply a KL-based loss to reduce differences between signals. This encourages the model to retain robust and generalizable features, enabling improved sensitivity to subtle semantic variations critical in pathology image understanding. During cross-modal alignment, the proposed method further amplifies modality-shared semantics while suppressing modality-specific noise and task-irrelevant factors, yielding more precise and pathology-aware image-text matching. Extensive experiments on five pathology benchmarks across three settings, including class generalization, few-shot learning, and cross-organ transfer, demonstrate that the proposed method consistently outperforms the existing approaches.
Huang et al. (Mon,) studied this question.