e14003 Background: Brain tumors represent a heterogeneous group of neoplasms with widely variable prognosis and treatment strategies. Magnetic resonance imaging (MRI) is the cornerstone of brain tumor diagnosis and classification, yet interpretation is challenged by tumor heterogeneity, infiltrative growth patterns, and overlapping radiologic features across tumor subtypes. Convolutional neural networks (CNNs) have demonstrated strong performance in automated MRI analysis but rely primarily on localized receptive fields, which may limit modeling of long-range spatial relationships essential for characterizing tumor extent, edema, and mass effect. Vision Transformers (ViTs) introduce a fundamentally different paradigm by leveraging self-attention to capture global contextual dependencies across entire images. We evaluated a ViT-B/16 model for automated diagnosis and classification of brain tumors on MRI. Methods: We analyzed publicly available, curated brain MRI datasets, comprising gliomas and non-glioma brain tumors with expert annotation and histopathologic correlation. Multisequence MRI images were standardized, augmented, and split into training and validation cohorts using stratified sampling. A Vision Transformer B/16 model pretrained on ImageNet was fine-tuned for multi-class brain tumor classification. Input images (224×224) were partitioned into non-overlapping 16×16 patches and embedded into a token sequence augmented with positional encodings and a learnable class token. The architecture employed 12 transformer encoder blocks with multi-head self-attention and feed-forward layers, enabling global contextual modeling across tumor and peritumoral regions. Model performance was assessed using accuracy, sensitivity, specificity, F1 score, and area under the receiver operating characteristic curve (AUROC). Results: The Vision Transformer achieved robust diagnostic performance across brain tumor classes, with overall accuracy exceeding 90% and AUROC greater than 0.90. Attention-based global modeling improved discrimination of infiltrative tumors and lesions with heterogeneous signal characteristics, reducing misclassification commonly observed with convolutional approaches. Performance remained stable across MRI sequences and tumor morphologies, supporting generalizability. Conclusions: Vision Transformer–based modeling enables accurate and interpretable diagnosis and classification of brain tumors by capturing long-range spatial context beyond localized feature extraction. Although computationally more intensive than CNNs, ViT architectures offer complementary strengths for complex neuro-oncology imaging tasks and warrant further prospective evaluation to support clinical decision-making and treatment planning.
Elangovan et al. (Thu,) studied this question.