What question did this study set out to answer?

This research aims to assess the effectiveness of Vision Transformers for classifying lung cancer in CT images and addressing the limitations of CNNs.

May 30, 2026

From local pixels to global patterns: Vision transformer–based modeling for lung cancer classification on computed tomography.

Key Points

This research aims to assess the effectiveness of Vision Transformers for classifying lung cancer in CT images and addressing the limitations of CNNs.
Analyzed a lung CT dataset of 5,000 images, including malignant and benign lesions, and normal findings.
Stratified the dataset into 80% training and 20% validation cohorts with class balancing.
Fine-tuned a Vision Transformer B/16 model pretrained on ImageNet for multi-class classification, employing attention-based global contextual reasoning.
Achieved a validation accuracy of 97%, with sensitivity for malignant lung cancer exceeding 96% and specificity of 98% for benign and normal findings.
The model reduced misclassification for lesions with complex spatial patterns and diffuse margins compared to CNNs.
Performance was comparable to high-performing CNN models but with a higher parameter count and computational cost.

Abstract

e20014 Background: Lung cancer remains the leading cause of cancer-related mortality worldwide, driven largely by delayed diagnosis and variability in interpretation of computed tomography (CT) imaging. Although convolutional neural networks (CNNs) have shown strong performance in pulmonary lesion analysis, their reliance on localized receptive fields may limit modeling of global anatomic context, particularly for heterogeneous tumors and lesions with diffuse margins. Vision Transformers (ViTs) introduce an attention-based paradigm that enables global contextual reasoning across entire images. We evaluated a Vision Transformer B/16 (ViT-B/16) model for lung cancer classification on CT imaging. Methods: A publicly available lung CT dataset comprising 5,000 images was analyzed, including malignant lung cancer, benign pulmonary lesions, and normal lung findings. Images were stratified into training (80%) and validation (20%) cohorts with class balancing applied to the training set. A ViT-B/16 model pretrained on ImageNet was fine-tuned for multi-class classification. Input images (224×224) were partitioned into non-overlapping 16×16 patches and embedded into a token sequence with positional encodings and a learnable class token. The architecture employed 12 transformer encoder blocks with multi-head self-attention and feed-forward layers. Performance was assessed using accuracy, sensitivity, specificity, precision, F1 score, confusion matrix analysis, and area under the receiver operating characteristic curve (AUROC). Results: The Vision Transformer achieved a validation accuracy of 97% with balanced class-wise performance. Sensitivity for malignant lung cancer exceeded 96%, with specificity of 98% for benign and normal findings. Attention-based global modeling reduced misclassification of lesions with complex spatial patterns and diffuse margins, a known limitation of purely convolutional architectures. Performance was comparable to high-performing CNN models, although achieved with higher parameter count and computational cost. Conclusions: Vision Transformer–based modeling enables accurate lung cancer classification on CT by leveraging attention-driven global contextual understanding beyond localized feature extraction. While less computationally efficient than optimized CNNs, ViTs offer complementary strengths in modeling complex spatial relationships and provide interpretable attention mechanisms. These findings support further investigation of transformer architectures in lung cancer screening, diagnostic triage, and AI-assisted radiologic workflows.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Sarveswar Chinnaswamy Dhandapani

Tanzeela Shuja

Elangovan Krishnan

Journals

Journal of Clinical Oncology

Actions

Institutions

Marshfield Clinic

Saveetha University

Marshfield Clinic

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

From local pixels to global patterns: Vision transformer–based modeling for lung cancer classification on computed tomography.

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study