575 Background: Breast cancer is the most commonly diagnosed malignancy among women worldwide, and screening mammography remains central to early detection and mortality reduction. Despite standardized reporting systems, mammographic interpretation is challenged by tissue overlap, subtle lesion morphology, and interobserver variability. Convolutional neural networks (CNNs) have demonstrated strong performance in automated mammography analysis but rely on localized receptive fields that may limit modeling of global spatial relationships critical for detecting architectural distortion and diffuse malignancy. Vision Transformers (ViTs) employ self-attention mechanisms that enable global contextual modeling across entire images. We evaluated the performance of a Vision Transformer for breast cancer classification on mammographic imaging. Methods: We analyzed over 8,000 anonymized mammography images obtained from publicly available Kaggle datasets curated and annotated by expert radiologists. Images included malignant and non-malignant breast findings and were preprocessed and resized to 224×224 pixels. Data were split into training and validation cohorts using stratified sampling at the patient level to prevent data leakage. A Vision Transformer B/16 model pretrained on ImageNet was fine-tuned for binary classification. Images were divided into non-overlapping 16×16 patches, embedded into token sequences with positional encodings and a learnable class token, and processed through 12 transformer blocks with multi-head self-attention. Model performance was evaluated using accuracy, sensitivity, specificity, F1 score, and area under the receiver operating characteristic curve (AUROC). Results: The ViT-B/16 model achieved high diagnostic performance, with validation accuracy approaching approximately 96% and AUROC exceeding 0.98. Sensitivity for malignant lesions remained robust across dense and non-dense breast tissue. Attention-based global processing reduced misclassification of lesions with diffuse margins, asymmetric density, and architectural distortion—patterns that are challenging for purely convolutional approaches. Performance was comparable to state-of-the-art CNN models, though achieved with higher parameter count and computational cost. Conclusions: Vision Transformer–based modeling enables accurate breast cancer detection on mammography by capturing global contextual relationships through self-attention. While computationally heavier than optimized CNNs, ViT offers complementary strengths in modeling complex spatial patterns and provides interpretable attention maps aligned with radiologic reasoning. These findings support further evaluation of transformer architectures in large-scale breast cancer screening and diagnostic workflows.
Shuja et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: