Background/Objectives: To develop and externally evaluate a deep learning framework for multi-label thoracic disease classification on chest radiographs using hybrid convolutional neural network (CNN)–transformer architectures, hierarchical scalar-weighted fusion, and ensemble strategies. Methods: This retrospective, multi-center study utilized publicly available datasets: NIH ChestX-ray14 (112,120 images; 30,805 patients) for model development and internal testing, and CheXpert (223,415 images) plus ChestX-Det10 (3578 images) for external validation. Nine CNN–transformer hybrids were systematically benchmarked, and the proposed model incorporated multi-scale DenseNet121 features, scalar-weighted fusion, positional encodings, and cross-attention. Four post hoc ensemble methods were explored, including a class-wise Top-3 Grid Search. Performance was evaluated using AUROC as the primary metric, along with precision, recall, F1-score, accuracy, specificity, positive predictive value, and negative predictive value. Statistical comparisons were performed using bootstrapped resampling and appropriate parametric or non-parametic tests. Results: On the NIH internal test set, the proposed hybrid model achieved a mean AUROC of 0.8495, which was significantly higher than that of the DenseNet121 baseline (0.8441, p = 0.032). The Top-3 Grid Search ensemble further improved internal performance, achieving a mean AUROC of 0.8577 (p < 0.00001). On external validation, the ensemble consistently outperformed DenseNet121, achieving mean AUROCs of 0.6500 on CheXpert (p < 0.001) and 0.6592 on ChestX-Det10 (p < 0.001). Per-class analysis revealed significant improvements for clinically important conditions such as cardiomegaly, mass, and pneumothorax. Grad-CAM visualizations demonstrated the strong alignment of predicted abnormalities with radiologically relevant regions. Conclusions: This CNN–transformer framework, particularly when combined with class-wise ensemble strategies, provided modest but statistically significant improvements in multi-label chest X-ray classification. External validation suggested partial generalizability across datasets, although performance remained moderate under domain shift.
Hsieh et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: