Skin cancer is among the most prevalent and life-threatening dermatological diseases worldwide, with melanoma responsible for a substantial proportion of skin cancer–related deaths due to delayed and unreliable diagnosis. Conventional clinical screening based on visual inspection and expert interpretation is inherently subjective and often affected by inter-observer variability, lesion heterogeneity, and imaging artifacts, highlighting the need for accurate and generalizable automated diagnostic systems. This study proposes a novel hybrid deep learning architecture for skin cancer classification that integrates an attention-guided autoencoder with a transformer-inspired global context modeling module, forming a unified and robust representation learning framework. The encoder–decoder structure is designed to suppress noise and reconstruct salient lesion features, while an embedded attention mechanism emphasizes diagnostically relevant regions such as irregular boundaries and pigmentation patterns. The encoded representations are subsequently refined using transformer-style self-attention to capture long-range spatial dependencies and complex color–texture correlations, enabling superior discrimination compared to conventional CNNs and standalone transformers. In addition, a new hybrid hyperparameter optimization strategy is introduced by synergistically combining Bayesian Optimization with Grey Wolf Optimization (GWO) and Whale Optimization Algorithm (WOA) into a coordinated meta-heuristic framework. Bayesian Optimization provides probabilistic guidance for efficient global search, while GWO and WOA enhance exploration–exploitation balance and prevent premature convergence by modeling collective hunting and encircling behaviors. This hybrid optimizer dynamically tunes both architectural and training hyperparameters, including learning rate, batch size, latent dimension size, attention depth, and transformer token resolution. The proposed framework is comprehensively evaluated on three benchmark dermoscopic datasets—HAM10000, ISIC-2019, and ISIC-2020—using standardized preprocessing and data augmentation to mitigate class imbalance and illumination variability. Experimental results demonstrate that the proposed approach consistently outperforms state-of-the-art CNN, transformer, and hybrid models, achieving classification accuracy exceeding 98% with improved F1-score and AUROC.
Abugabah et al. (Mon,) studied this question.