Background/Objectives: Early and accurate detection of breast cancer is essential for reducing mortality and improving patient outcomes. However, the manual interpretation of breast ultrasound images is challenging due to image variability, noise, and inter-observer subjectivity. This study aims to address these limitations by developing an automated and interpretable computer-aided diagnosis (CAD) system. Methods: We propose an automated and interpretable computer-aided diagnosis (CAD) system that integrates ensemble transfer learning with Vision Transformer architectures. The system combines the Data-Efficient Image Transformer (Deit) and Vision Transformer (ViT) through concatenation-based feature fusion to exploit their complementary representations. Preprocessing, normalization, and targeted data augmentation enhance robustness, while Gradient-weighted Class Activation Mapping (Grad-CAM) provides visual explanations to support clinical interpretability. The proposed model is benchmarked against state-of-the-art CNNs (VGG16, ResNet50, DenseNet201) and Transformer models (ViT, DeiT, Swin, Beit) using the Breast Ultrasound Images (BUSI) dataset. Results: The ensemble achieved 96. 92% accuracy and 97. 10% AUC for binary classification, and 94. 27% accuracy with 94. 81% AUC for three-class classification. External validation on independent datasets demonstrated strong generalizability, with 87. 76%/88. 07% accuracy/AUC on BrEaST, 86. 77%/85. 90% on BUS-BRA, and 86. 99%/86. 99% on BUSIWHU. Performance decreased for fine-grained BI-RADS classification—76. 68%/84. 59% accuracy/AUC on BUS-BRA and 68. 75%/81. 10% on BrEaST—reflecting the inherent complexity and subjectivity of clinical subclassification. Conclusions: The proposed Vision Transformer-based ensemble demonstrates high diagnostic accuracy, strong cross-dataset generalization, and clinically meaningful explainability. These findings highlight its potential as a reliable second-opinion CAD tool for breast cancer diagnosis, particularly in resource-limited clinical environments.
Al-Tam et al. (Fri,) studied this question.