Stroke is a major cause of mortality and long-term disability worldwide, and rapid diagnosis is critical for timely treatment. Computed tomography (CT) imaging is widely used for initial stroke assessment, yet manual interpretation can be time-consuming and dependent on specialist availability. This study proposes a hybrid deep learning framework that combines a convolutional neural network (CNN) and a vision transformer (ViT) for automated stroke classification from brain CT images. The CNN captures localized spatial features while the ViT models global contextual relationships, and their feature representations are fused using a lightweight feed-forward network. The model was evaluated on a publicly available brain CT dataset using a fixed train–validation–test split. The proposed ensemble achieved an accuracy of 99.77%, precision of 99.24%, recall of 100.00%, F1-score of 99.62%, and an AUC of 0.9999. Confusion matrix analysis showed zero false negatives and one false positive in the test set. Training curves demonstrated stable convergence, and interpretability methods (LIME and occlusion sensitivity) highlighted image regions influencing predictions. Although the results indicate strong performance, the dataset represents a controlled and curated environment and does not fully capture the variability of real clinical imaging. Therefore, the reported accuracy should be interpreted as benchmark performance rather than definitive clinical diagnostic capability. Future work will involve multi-centre validation using hospital-acquired data and expert clinical evaluation. The findings suggest that CNN–ViT feature fusion is a promising approach for computer-aided stroke screening and may support radiologists in prioritizing suspicious cases after appropriate clinical validation.
Turyamuhaki et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: