What question did this study set out to answer?

This study evaluates and compares the effectiveness of transfer learning models and Vision Transformer in classifying skin cancer images.

February 14, 2026Open Access

Comparative Analysis of Transfer Learning and Vision Transformer Models for Skin Cancer Classification Using Enhanced Dermoscopic Images

Key Points

This study evaluates and compares the effectiveness of transfer learning models and Vision Transformer in classifying skin cancer images.
Compared five transfer learning models: DenseNet169, InceptionV3, MobileNetV2, VGG16, and Xception.
Applied a Vision Transformer (ViT) model for classification.
Implemented image enhancement techniques such as grayscale conversion, thresholding, Canny edge detection, dilation, and erosion.
ViT achieved 93.79% recall, 92.22% precision, 93.00% F1-score, and 92.42% accuracy in initial tests.
Using enhanced images, ViT's metrics improved to 95.49% recall, 94.17% precision, 94.83% F1-score, and 94.39% accuracy.
InceptionV3 and MobileNetV2 showed strong recall but did not surpass the overall accuracy of ViT.

Abstract

In recent years, deep learning has achieved remarkable advancements in medical image analysis, particularly through Convolutional Neural Networks (CNNs) and Transformer-based architectures. This study aims to evaluate and compare the performance of five transfer learning models (DenseNet169, InceptionV3, MobileNetV2, VGG16 and Xception) and a Vision Transformer (ViT) model for the classification of skin cancer using the “Skin Cancer: Malignant vs. Benign” dataset .In the first phase, the ViT model achieved the highest overall performance with 93.79% recall, 92.22% precision, 93.00% F1-score and 92.42% accuracy. Although InceptionV3 and MobileNetV2 demonstrated strong recall values, they did not match the overall accuracy of ViT. In the second phase, image enhancement techniques—grayscale conversion, thresholding, Canny edge detection, dilation, and erosion were applied to emphasize lesion boundaries and improve contrast. Using these enhanced images, the ViT model again achieved the best performance, with 95.49% recall, 94.17% precision, 94.83% F1-score, and 94.39% accuracy. These results indicate that the ViT architecture provides superior accuracy and reliability in complex and enhanced medical images. Furthermore, the study demonstrates that incorporating image preprocessing techniques can significantly enhance the performance of deep learning models in medical imaging applications.

Perguntar à IA

Bookmark

View Full Paper