Early detection of oral squamous cell carcinoma (OSCC) is crucial for improving patient outcomes and reducing treatment costs, particularly in resource-limited settings that rely on mobile screening. This study introduces a tailored attention-enhanced MobileNetV2 model that integrates a dropout-regularized Convolutional Block Attention Module (CBAM) at an intermediate bottleneck, resulting in a compact model with only 2.67 million parameters and a 10.21 MB footprint. Evaluated on the class-imbalanced Mendeley dataset of 1,224 histological images through 5-fold cross-validation, the model achieved a mean accuracy of 89.46% ± 2.66%, with a peak accuracy of 92.24% and a mean ROC-AUC of 0.9950 ± 0.0045, demonstrating an exceptional discriminative capability. Class-wise performance showed balanced results (normal: precision, 76.3%; recall, 82.5%; F1 = 0.788; OSCC: precision, 94.6%; recall, 91.5%; F1 = 0.930). Training on an augmented class-balanced Kaggle dataset significantly enhanced the performance, achieving a mean accuracy of 96.38% ± 0.64% and a mean ROC-AUC of 0.9915 ± 0.0025. The network supports single-image inference in 15.9 ms and completes training on new clinical data in less than an hour, allowing for rapid updates. Gradient-weighted Class Activation Mapping (Grad-CAM) provides interpretability by highlighting morphologically relevant regions that are consistent with OSCC pathology. This lightweight, attention-augmented architecture delivered robust performance on both datasets, offering a fast, interpretable, and portable framework for OSCC screening in diverse clinical environments. • An attention-enhanced MobileNetV2 with dropout-regularized CBAM achieves strong OSCC detection using only 2.67M parameters (10.21 MB), enabling fast 15.9 ms single-image inference suitable for mobile and low-resource settings. • The model delivers up to 92.24% accuracy and ROC-AUC of 0.9950 on a class-imbalanced histological dataset, with balanced class-wise results, and further improves to 96.38% accuracy on an augmented, balanced dataset. • Training completes in under an hour on new data, and Grad-CAM visualizations highlight clinically relevant morphological regions, supporting transparent and reliable OSCC screening.
Manihira et al. (Fri,) studied this question.