e13680 Background: Skin cancer is the most common malignancy worldwide, encompassing a spectrum from benign nevi to aggressive melanoma and nonmelanoma skin cancers. Dermoscopy improves diagnostic accuracy but remains operator dependent, with substantial interobserver variability, particularly for atypical or early lesions. Convolutional neural networks (CNNs) have demonstrated expert-level performance in skin lesion classification; however, their reliance on localized receptive fields may limit modeling of global lesion symmetry, border irregularity, and spatial color patterns central to dermoscopic diagnosis. Vision Transformers (ViTs) introduce an alternative paradigm using self-attention to capture long-range contextual relationships across entire images. We evaluated a ViT-B/16 model for automated diagnosis and classification of skin tumors from dermoscopic images. Methods: We analyzed anonymized publicly available dermoscopy datasets derived from established skin cancer imaging repositories, including the ISIC Archive, comprising malignant melanoma, basal cell carcinoma, squamous cell carcinoma, and benign melanocytic lesions, with ground truth established by histopathology or expert consensus. Images were standardized, augmented, and split into training and validation cohorts using stratified sampling. A Vision Transformer B/16 model pretrained on ImageNet was fine-tuned for multi-class skin tumor classification. Input images (224×224) were partitioned into non-overlapping 16×16 patches and embedded into a token sequence with positional encodings and a learnable class token. The architecture employed 12 transformer encoder blocks with multi-head self-attention and feed-forward networks. Performance was evaluated using accuracy, sensitivity, specificity, F1 score, and area under the receiver operating characteristic curve (AUROC). Results: The Vision Transformer achieved strong diagnostic performance across skin tumor classes, with overall accuracy exceeding 96% and AUROC greater than 0.96. Attention-based global modeling improved discrimination of lesions with irregular borders, heterogeneous pigmentation, and asymmetric morphology—features critical for melanoma detection. Performance remained stable across lesion subtypes and imaging conditions, supporting robustness and generalizability comparable to state-of-the-art CNN approaches. Conclusions: Vision Transformer–based modeling enables accurate and interpretable classification of skin tumors by capturing global dermoscopic context beyond localized feature extraction. Although computationally more intensive than conventional CNNs, ViT architectures offer complementary strengths in modeling complex lesion morphology and warrant further prospective evaluation for integration into clinical skin cancer screening and diagnostic workflows.
Elangovan et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: