e21532 Background: Melanoma is an aggressive skin cancer with rapidly rising incidence and disproportionately high mortality when diagnosed at advanced stages. Dermoscopy improves diagnostic accuracy but remains highly operator dependent, with substantial interobserver variability, particularly for early or atypical lesions. Although convolutional neural networks (CNNs) have demonstrated strong performance in automated skin lesion analysis, their reliance on localized receptive fields may limit modeling of global lesion characteristics—such as asymmetry, border irregularity, and spatial color heterogeneity—that are central to melanoma diagnosis. Vision Transformers (ViTs) introduce an alternative paradigm based on self-attention, enabling global contextual reasoning across entire images. We evaluated a Vision Transformer B/16 (ViT-B/16) model for automated melanoma diagnosis using dermoscopic imaging. Methods: We analyzed publicly available dermoscopy datasets from established melanoma imaging repositories, including the ISIC Archive and HAM10000 collections, comprising malignant melanoma, non-melanoma skin cancers, and benign melanocytic lesions with histopathologic confirmation or expert consensus annotation. Images were standardized, augmented, and stratified into training and validation cohorts. A ViT-B/16 model pretrained on ImageNet was fine-tuned for multi-class lesion classification. Input images (224×224) were partitioned into non-overlapping 16×16 patches and embedded into a token sequence with positional encodings and a learnable class token. The architecture employed 12 transformer encoder blocks with multi-head self-attention and feed-forward layers. Performance was assessed using accuracy, sensitivity, specificity, F1 score, and area under the receiver operating characteristic curve (AUROC). Results: The Vision Transformer achieved strong diagnostic performance, with overall accuracy exceeding 97% and AUROC greater than 0.97 across lesion categories. Attention-based global modeling improved discrimination of melanomas with asymmetric structure, heterogeneous pigmentation, and irregular borders—features commonly associated with diagnostic uncertainty. Performance remained stable across lesion subtypes and imaging conditions and was comparable to state-of-the-art CNN-based approaches. Conclusions: Vision Transformer–based modeling enables accurate and interpretable melanoma classification by capturing global dermoscopic context beyond localized feature extraction. Although computationally more intensive than conventional CNNs, ViT architectures offer complementary strengths for modeling complex lesion morphology and warrant further evaluation for integration into melanoma screening and AI-assisted dermatologic workflows.
Elangovan et al. (Thu,) studied this question.