What type of study is this?

September 10, 2025

Enhancing the Interpretation of Skin Lesion Diagnosis: Concept Adaptive Fine-Tuning of Vision-Language Models

Key Points

Our method improves the classification performance of skin lesions, enhancing interpretability with natural language.
Fine-tuning the vision-language model with a small dataset improved concept recognition ability by 8.28%.
The approach incorporates medical texts to better recognize features of skin lesions, making models more explainable.
The concept adaptive tuning allows quick adaptation of VLMs to specialized medical tasks, reducing data requirements.

Abstract

Significant progress has been made in applying deep learning for the automatic diagnosis of skin lesions. However, most models remain unexplainable, which severely hinders their application in clinical settings. Concept-based ante-hoc interpretable models have the potential to clarify the decision-making process of diagnosis by learning high-level, human-understandable concepts, while they can only provide numerical values of conceptual contributions. Pre-trained Vision-Language Models (VLMs) can learn rich vision-language correlations from large-scale image-text pairs. Fine-tuning pre-trained VLMs for specific downstream tasks is an effective way to reduce data requirements. Nevertheless, when there is a substantial disparity between the pre-trained model and the target task, existing tuning methods frequently struggle to generalize, necessitating substantial training data to fully adapt VLMs to specialized medical tasks. In this work, we propose a concept adaptive fine-tuning (CptAFT) method based on the pre-trained VLM, BiomedCLIP, to develop a concept-based multi-modal interpretable skin lesion diagnosis model. By incorporating medical texts, such as reports and conceptual terms, our model can recognize fine-grained features and provide robust, natural language-driven interpretability. Moreover, our concept-adaptive method that reconstructs images using concept logits and imposes a consistency loss with the original image, enabling the VLM to quickly adapt to the task with a small amount of training data. Extensive experimental results demonstrate that our approach outperforms state-of-the-art black-box and interpretable models in both classification performance and medically relevant interpretability. In particular, after fine-tuning with a small amount of data, our model outperforms MONET, a model trained on the large Skin Disease Image-Report dataset, by 8.28% in concept recognition ability, demonstrating the interpretability of our model. Codes are available at https://github.com/zjmiaprojects/CptAFT.

Bookmark

Cite This Study

Zhu et al. (Wed,) studied this question.

synapsesocial.com/papers/68c1955c9b7b07f3a061947e https://doi.org/https://doi.org/10.1109/jbhi.2025.3606881

Bookmark