What question did this study set out to answer?

The aim is to evaluate the performance of a Vision Transformer model for classifying breast cancer in mammograms.

May 29, 2026

Beyond convolution: Attention-driven vision transformer modeling for breast cancer detection on mammography.

Key Points

The aim is to evaluate the performance of a Vision Transformer model for classifying breast cancer in mammograms.
Analyzed over 8,000 mammography images from publicly available datasets, curated by expert radiologists.
Utilized a pre-trained Vision Transformer B/16 model for binary classification after fine-tuning on training data.
Evaluated model performance through metrics like accuracy, sensitivity, specificity, F1 score, and AUROC.
ViT-B/16 model achieved validation accuracy of approximately 96% and AUROC greater than 0.98.
Sensitivity for detecting malignant lesions remained consistent across various breast tissue densities.
Model successfully reduced misclassification of complex lesions, outperforming standard CNN approaches.

Abstract

575 Background: Breast cancer is the most commonly diagnosed malignancy among women worldwide, and screening mammography remains central to early detection and mortality reduction. Despite standardized reporting systems, mammographic interpretation is challenged by tissue overlap, subtle lesion morphology, and interobserver variability. Convolutional neural networks (CNNs) have demonstrated strong performance in automated mammography analysis but rely on localized receptive fields that may limit modeling of global spatial relationships critical for detecting architectural distortion and diffuse malignancy. Vision Transformers (ViTs) employ self-attention mechanisms that enable global contextual modeling across entire images. We evaluated the performance of a Vision Transformer for breast cancer classification on mammographic imaging. Methods: We analyzed over 8,000 anonymized mammography images obtained from publicly available Kaggle datasets curated and annotated by expert radiologists. Images included malignant and non-malignant breast findings and were preprocessed and resized to 224×224 pixels. Data were split into training and validation cohorts using stratified sampling at the patient level to prevent data leakage. A Vision Transformer B/16 model pretrained on ImageNet was fine-tuned for binary classification. Images were divided into non-overlapping 16×16 patches, embedded into token sequences with positional encodings and a learnable class token, and processed through 12 transformer blocks with multi-head self-attention. Model performance was evaluated using accuracy, sensitivity, specificity, F1 score, and area under the receiver operating characteristic curve (AUROC). Results: The ViT-B/16 model achieved high diagnostic performance, with validation accuracy approaching approximately 96% and AUROC exceeding 0.98. Sensitivity for malignant lesions remained robust across dense and non-dense breast tissue. Attention-based global processing reduced misclassification of lesions with diffuse margins, asymmetric density, and architectural distortion—patterns that are challenging for purely convolutional approaches. Performance was comparable to state-of-the-art CNN models, though achieved with higher parameter count and computational cost. Conclusions: Vision Transformer–based modeling enables accurate breast cancer detection on mammography by capturing global contextual relationships through self-attention. While computationally heavier than optimized CNNs, ViT offers complementary strengths in modeling complex spatial patterns and provides interpretable attention maps aligned with radiologic reasoning. These findings support further evaluation of transformer architectures in large-scale breast cancer screening and diagnostic workflows.

Mark Helpful

Bookmark

Relay