Lung cancer screening (LCS) from computed tomography (CT) is notoriously difficult due to nodules that only have borderline visual patterns that overlap across multiple diagnostic categories. Most existing CAD systems rely solely on CNNs or standalone transformers, with limited global–local feature synergy, interpretability, and multi-class stratification within a distributed framework. To circumvent such limitations, we present an approach, LungDxFormer, a hybrid CNN–Transformer model with a Dynamic Spatial Attention (DSA) mechanism for clinically relevant region attention and interpretable decision presentation. The framework directly classifies lung nodules into three classes (benign, indeterminate, and malignant). Using patient-wise cross-validation on the public LIDC-IDRI dataset, our method achieves 97.35% overall accuracy with high precision, recall, and AUC across all three classes, including the clinically challenging indeterminate class. We can further explain the model using Grad-CAM visualisations that identify diagnostically relevant regions, consistent with clinicians’ expectations. These results support lung nodule classification using CT scans with LungDxFormer as a novel, interpretable, and robust approach that could provide accurate, interpretable CT-based classification.
Rao et al. (Sat,) studied this question.