What question did this study set out to answer?

This study aims to develop and evaluate an interpretable deep learning model for early detection of oral squamous cell carcinoma.

May 16, 2026Open Access

Compact Attention-Driven Deep Network for Interpretable Oral Squamous Cell Carcinoma Screening

Key Points

This study aims to develop and evaluate an interpretable deep learning model for early detection of oral squamous cell carcinoma.
Developed a dropout-regularized MobileNetV2 model with a Convolutional Block Attention Module.
Evaluated on a class-imbalanced dataset of histological images with 5-fold cross-validation.
Utilized gradient-weighted Class Activation Mapping for interpretability.
Achieved a peak accuracy of 92.24% and mean ROC-AUC of 0.9950 on a class-imbalanced dataset.
Increased accuracy to 96.38% on an augmented, balanced dataset.
Enabled single-image inference in 15.9 ms and training on new data in under an hour.

Abstract

Early detection of oral squamous cell carcinoma (OSCC) is crucial for improving patient outcomes and reducing treatment costs, particularly in resource-limited settings that rely on mobile screening. This study introduces a tailored attention-enhanced MobileNetV2 model that integrates a dropout-regularized Convolutional Block Attention Module (CBAM) at an intermediate bottleneck, resulting in a compact model with only 2.67 million parameters and a 10.21 MB footprint. Evaluated on the class-imbalanced Mendeley dataset of 1,224 histological images through 5-fold cross-validation, the model achieved a mean accuracy of 89.46% ± 2.66%, with a peak accuracy of 92.24% and a mean ROC-AUC of 0.9950 ± 0.0045, demonstrating an exceptional discriminative capability. Class-wise performance showed balanced results (normal: precision, 76.3%; recall, 82.5%; F1 = 0.788; OSCC: precision, 94.6%; recall, 91.5%; F1 = 0.930). Training on an augmented class-balanced Kaggle dataset significantly enhanced the performance, achieving a mean accuracy of 96.38% ± 0.64% and a mean ROC-AUC of 0.9915 ± 0.0025. The network supports single-image inference in 15.9 ms and completes training on new clinical data in less than an hour, allowing for rapid updates. Gradient-weighted Class Activation Mapping (Grad-CAM) provides interpretability by highlighting morphologically relevant regions that are consistent with OSCC pathology. This lightweight, attention-augmented architecture delivered robust performance on both datasets, offering a fast, interpretable, and portable framework for OSCC screening in diverse clinical environments. • An attention-enhanced MobileNetV2 with dropout-regularized CBAM achieves strong OSCC detection using only 2.67M parameters (10.21 MB), enabling fast 15.9 ms single-image inference suitable for mobile and low-resource settings. • The model delivers up to 92.24% accuracy and ROC-AUC of 0.9950 on a class-imbalanced histological dataset, with balanced class-wise results, and further improves to 96.38% accuracy on an augmented, balanced dataset. • Training completes in under an hour on new data, and Grad-CAM visualizations highlight clinically relevant morphological regions, supporting transparent and reliable OSCC screening.

Compact Attention-Driven Deep Network for Interpretable Oral Squamous Cell Carcinoma Screening

Key Points

Abstract

Cite This Study