Significant progress has been made in applying deep learning for the automatic diagnosis of skin lesions. However, most models remain unexplainable, which severely hinders their application in clinical settings. Concept-based ante-hoc interpretable models have the potential to clarify the decision-making process of diagnosis by learning high-level, human-understandable concepts, while they can only provide numerical values of conceptual contributions. Pre-trained Vision-Language Models (VLMs) can learn rich vision-language correlations from large-scale image-text pairs. Fine-tuning pre-trained VLMs for specific downstream tasks is an effective way to reduce data requirements. Nevertheless, when there is a substantial disparity between the pre-trained model and the target task, existing tuning methods frequently struggle to generalize, necessitating substantial training data to fully adapt VLMs to specialized medical tasks. In this work, we propose a concept adaptive fine-tuning (CptAFT) method based on the pre-trained VLM, BiomedCLIP, to develop a concept-based multi-modal interpretable skin lesion diagnosis model. By incorporating medical texts, such as reports and conceptual terms, our model can recognize fine-grained features and provide robust, natural language-driven interpretability. Moreover, our concept-adaptive method that reconstructs images using concept logits and imposes a consistency loss with the original image, enabling the VLM to quickly adapt to the task with a small amount of training data. Extensive experimental results demonstrate that our approach outperforms state-of-the-art black-box and interpretable models in both classification performance and medically relevant interpretability. In particular, after fine-tuning with a small amount of data, our model outperforms MONET, a model trained on the large Skin Disease Image-Report dataset, by 8.28% in concept recognition ability, demonstrating the interpretability of our model. Codes are available at https://github.com/zjmiaprojects/CptAFT.
Zhu et al. (Wed,) studied this question.