While deep learning models have achieved remarkable diagnostic accuracy in medical imaging, their inherent "black box" nature severely impedes clinical adoption due to a lack of transparency and trust. Current eXplainable AI (XAI) methods, such as saliency maps, offer low-level feature attribution but fail to provide clinically meaningful reasoning. State-of-the-art vision-language models trained end-to-end on image-report pairs often learn to exploit superficial correlations from noisy data, generating plausible but clinically vacuous explanations. To address this critical gap, we propose K-Distill-XAI, a novel teacher-student framework that decouples visual feature learning from high-level clinical reasoning. We first train a domain-expert "teacher" Large Language Model (LLM) on a vast corpus of biomedical literature to generate canonical, text-based clinical rationales. Subsequently, we train a multimodal "student" vision-language model using a cross-modal knowledge distillation objective, compelling it to generate explanations that are semantically aligned with the teacher's expert reasoning. Extensive experiments on the public MIMIC-CXR dataset demonstrate the superiority of our approach. K-Distill-XAI significantly outperforms state-of-the-art baselines in clinical accuracy, achieving an 8% relative improvement in CheXbert F1 score for report generation. Furthermore, this distillation process also boosts classification performance and yields state-of-the-art micro-averaged AUC across 14 clinical conditions.
Fu et al. (Thu,) studied this question.