e20014 Background: Lung cancer remains the leading cause of cancer-related mortality worldwide, driven largely by delayed diagnosis and variability in interpretation of computed tomography (CT) imaging. Although convolutional neural networks (CNNs) have shown strong performance in pulmonary lesion analysis, their reliance on localized receptive fields may limit modeling of global anatomic context, particularly for heterogeneous tumors and lesions with diffuse margins. Vision Transformers (ViTs) introduce an attention-based paradigm that enables global contextual reasoning across entire images. We evaluated a Vision Transformer B/16 (ViT-B/16) model for lung cancer classification on CT imaging. Methods: A publicly available lung CT dataset comprising 5,000 images was analyzed, including malignant lung cancer, benign pulmonary lesions, and normal lung findings. Images were stratified into training (80%) and validation (20%) cohorts with class balancing applied to the training set. A ViT-B/16 model pretrained on ImageNet was fine-tuned for multi-class classification. Input images (224×224) were partitioned into non-overlapping 16×16 patches and embedded into a token sequence with positional encodings and a learnable class token. The architecture employed 12 transformer encoder blocks with multi-head self-attention and feed-forward layers. Performance was assessed using accuracy, sensitivity, specificity, precision, F1 score, confusion matrix analysis, and area under the receiver operating characteristic curve (AUROC). Results: The Vision Transformer achieved a validation accuracy of 97% with balanced class-wise performance. Sensitivity for malignant lung cancer exceeded 96%, with specificity of 98% for benign and normal findings. Attention-based global modeling reduced misclassification of lesions with complex spatial patterns and diffuse margins, a known limitation of purely convolutional architectures. Performance was comparable to high-performing CNN models, although achieved with higher parameter count and computational cost. Conclusions: Vision Transformer–based modeling enables accurate lung cancer classification on CT by leveraging attention-driven global contextual understanding beyond localized feature extraction. While less computationally efficient than optimized CNNs, ViTs offer complementary strengths in modeling complex spatial relationships and provide interpretable attention mechanisms. These findings support further investigation of transformer architectures in lung cancer screening, diagnostic triage, and AI-assisted radiologic workflows.
Building similarity graph...
Analyzing shared references across papers
Loading...
Sarveswar Chinnaswamy Dhandapani
Tanzeela Shuja
Elangovan Krishnan
Journal of Clinical Oncology
Marshfield Clinic
Saveetha University
Marshfield Clinic
Building similarity graph...
Analyzing shared references across papers
Loading...
Dhandapani et al. (Thu,) studied this question.
synapsesocial.com/papers/6a1a82d50307b78509434899 — DOI: https://doi.org/10.1200/jco.2026.44.16_suppl.e20014