Cervical cancer remains a critical global health challenge, particularly in low-resource settings, necessitating robust computer-aided diagnostic (CAD) systems for early detection. This study introduces a novel deep learning architecture that fuses a self-attention-enhanced dense convolutional neural network (CNN) with a modified vision transformer (ViT) to improve classification accuracy for cervical cancer. The dense CNN captures local features such as cell nuclei morphology. In contrast, the modified Vision transformer enables contextual feature extraction, allowing the model to extract both fine-grained and global details. The framework leverages image enhancement (histogram equalization and top/bottom hat filtering) to improve data quality and data augmentation to address challenges such as dataset scarcity and class imbalance. Evaluated on the Herlev (917 cell images) and Intel & Mobile-ODT (8,215 colposcopy images) datasets, the fused model achieves state-of-the-art accuracies of 96% (Herlev) and 92% (Intel & Mobile-ODT), outperforming standalone CNNs (up to 72%), ViTs (up to 91%), and prior methods (up to 91.94%). Fusing these models via depth concatenation and neural classifiers (e.g., Fine Gaussian SVM) yields superior generalization, with AUCs of 0.991 and 0.981 on the respective datasets. This work underscores the potential of hybrid architectures in bridging the inductive biases of CNNs and transformers for medical image analysis. When compared to previous research that has relied solely on CNNs or basic transformers, this work offers an innovative combination of a dense CNN with self-attention and a modified residual-enriched vision transformer, addressing the limitations of single-modality feature extraction and improving generalization across cytology and colposcopy images.
Abrar et al. (Mon,) studied this question.