Accurate and interpretable cancer classification in histopathological images remains a significant challenge due to the complex structural variations in tissue samples. In this paper, we propose MDeiT, a lightweight and interpretable sequential hybrid model that effectively integrates Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) to enhance both classification accuracy and efficiency. Unlike traditional ensemble-based hybrid models, our framework adopts a streamlined design, leveraging MobileNetV2 and DeiT Tiny as backbone architectures, with an adaptation layer facilitating the transition from CNN-extracted local features to Transformer processing. To improve interpretability, we incorporate Gradient-weighted Class Activation Mapping (Grad-CAM) for visual explanations of model predictions. Furthermore, we introduce expert-driven qualitative validation, where pathologists annotate ground truth to systematically assess the alignment between model-generated saliency maps and clinically relevant diagnostic regions, establishing a high-quality benchmark for interpretability evaluation. Extensive experiments on skin and lung cancer datasets demonstrate that MDeiT consistently outperforms state-of-the-art models across multiple metrics while maintaining computational efficiency. The results demonstrate its effectiveness in capturing both fine-grained tissue details and broader contextual patterns, making it a robust and scalable solution for real-world histopathological image analysis.
Dagnaw et al. (Sun,) studied this question.