What question did this study set out to answer?

This research aims to develop an improved deep learning model for cervical cancer classification using advanced architectures.

May 27, 2026Open Access

An interpretable fused dense-based vision transformer architecture for the classification of cervical cancer

Key Points

This research aims to develop an improved deep learning model for cervical cancer classification using advanced architectures.
Introduced a fused architecture combining a dense convolutional neural network and a modified vision transformer.
Utilized image enhancement techniques and data augmentation to improve model performance and address data scarcity.
Evaluated the model on Herlev (917 images) and Intel & Mobile-ODT (8,215 images) datasets.
Achieved 96% accuracy on the Herlev dataset and 92% on the Intel & Mobile-ODT dataset.
Outperformed standalone CNNs (up to 72%) and ViTs (up to 91%).
Achieved AUCs of 0.991 and 0.981 on both datasets, indicating superior model generalization.

Abstract

Cervical cancer remains a critical global health challenge, particularly in low-resource settings, necessitating robust computer-aided diagnostic (CAD) systems for early detection. This study introduces a novel deep learning architecture that fuses a self-attention-enhanced dense convolutional neural network (CNN) with a modified vision transformer (ViT) to improve classification accuracy for cervical cancer. The dense CNN captures local features such as cell nuclei morphology. In contrast, the modified Vision transformer enables contextual feature extraction, allowing the model to extract both fine-grained and global details. The framework leverages image enhancement (histogram equalization and top/bottom hat filtering) to improve data quality and data augmentation to address challenges such as dataset scarcity and class imbalance. Evaluated on the Herlev (917 cell images) and Intel & Mobile-ODT (8,215 colposcopy images) datasets, the fused model achieves state-of-the-art accuracies of 96% (Herlev) and 92% (Intel & Mobile-ODT), outperforming standalone CNNs (up to 72%), ViTs (up to 91%), and prior methods (up to 91.94%). Fusing these models via depth concatenation and neural classifiers (e.g., Fine Gaussian SVM) yields superior generalization, with AUCs of 0.991 and 0.981 on the respective datasets. This work underscores the potential of hybrid architectures in bridging the inductive biases of CNNs and transformers for medical image analysis. When compared to previous research that has relied solely on CNNs or basic transformers, this work offers an innovative combination of a dense CNN with self-attention and a modified residual-enriched vision transformer, addressing the limitations of single-modality feature extraction and improving generalization across cytology and colposcopy images.

AIに質問

Bookmark

View Full Paper

Cite This Study

Abrar et al. (Mon,) studied this question.

synapsesocial.com/papers/6a168ab40c924ddd1bd597dc https://doi.org/https://doi.org/10.1186/s40001-026-04457-y

AIに質問

Bookmark

View Full Paper