What question did this study set out to answer?

April 12, 2026Open Access

Benchmarking MedViT and hybrid CNN–ViT architectures for multi-label thoracic disease classification

Key Points

To evaluate the performance of MedViT and Hybrid CNN–ViT architectures for multi-label thoracic disease classification.
Adapted MedViT and Hybrid CNN–ViT architectures for CXR image classification
Utilized transfer learning and domain-specific augmentations
Trained models on NIH ChestX-ray14 and CheXpert datasets
Compared model performance against state-of-the-art methods
MedViT achieved 93.34% accuracy and 94.17% macro AUROC on NIH ChestX-ray14
Hybrid CNN–ViT reached 85.81% accuracy and 72.28% macro AUROC on NIH ChestX-ray14
MedViT had 79.22% accuracy and 75.11% macro AUROC on CheXpert
Robust precision and recall were observed for under-represented conditions like fibrosis and hernia

Abstract

Computer-aided diagnosis relies heavily on the automatic classification of thoracic diseases from chest X-ray (CXR) images, yet this task remains challenging due to class imbalance, overlapping radiological features, and high inter-class similarity. In this study, two architectures MedViT and Hybrid CNN–ViT are adapted and evaluated, which are a scalable Vision Transformer (ViT)-based architecture designed for multi-label thoracic disease classification. MedViT is enhanced with transfer learning, domain-specific augmentations, and self-attention mechanisms to capture subtle pathological patterns across diverse conditions. The Hybrid CNN–ViT is the combination of strength of CNN and ViT which is admirable in capturing local patterns. Both models are trained and validated on two benchmark datasets, NIH ChestX-ray14 and CheXpert, and compared against state-of-the-art baselines. On the NIH ChestX-ray14 dataset, MedViT showed strong performance with 93.34% accuracy and a macro AUROC of 94.17%, while the Hybrid CNN–ViT model reached 85.81% accuracy and 72.28% macro AUROC. On the CheXpert dataset, MedViT achieved 79.22% accuracy and a macro AUROC of 75.11%, whereas Hybrid CNN–ViT achieved 76.15% accuracy and 71.68% macro AUROC. These results show that MedViT performs well and generalizes effectively across different datasets. Per-label analysis demonstrated robust precision and recall even for under-represented conditions such as fibrosis and hernia, where existing models typically show significant performance drops. Unlike earlier methods that often struggle with generalization, MedViT maintains a balanced trade-off between sensitivity and specificity across all categories. These findings highlight the effectiveness of Transformer-based feature encoding in capturing subtle spatial correlations in medical imaging, while also setting new benchmarks for automated thoracic disease classification. The MedViT model outperformed the state-of-the-art methods and shows strong potential to support radiologists in decision-making and improve diagnostic workflows in clinical practice.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Victor Mawutor Agbo

Marwadi University

Ruchi Patel

Rutgers, The State University of New Jersey

Munindra Lunagaria

Marwadi University

Journals

Scientific Reports

Actions

Institutions

Barkatullah University

Manipal University Jaipur

Marwadi University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Benchmarking MedViT and hybrid CNN–ViT architectures for multi-label thoracic disease classification

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study