Breast cancer is one of the greatest global health burdens today and demands accurate diagnosis because of the vast histological variety. CNN-based systems had been the dominant technology in Digital Pathology, but with their inability to create a global representation has allowed other technologies such as Vision Transformers to compete. This paper evaluate the performance of three different transformer-based backbone architectures (DeiT Base, Swin Base, and ViT Base) for classifying breast histopathological images into eight granular classes using the BreaKHis database. To facilitate this comparison, we utilize transfer learning and distinct data augmentation methods. Each architecture was fine-tuned to classify four benign and four malignant subtypes with a minimum reported accuracy of 94%, with Swin Base performing more optimally than either of the other two approaches, obtaining highest reported accuracy of 0.9511 and an F1 score of 0.9434. The unique design and shifted windowing processes of Swin Base have allowed this architecture to capture detailed nuclear information as well as the larger context regarding breast cancers, to an extent greater than the other two architectures. Additionally, we provide an in-depth study of confusion matrices in conjunction with high classification accuracy, even when dealing with minor morphological overlap, to further support their claim regarding the ability of Swin Base and the remaining transformer architectures to successfully differentiate between histologically similar classes.
Koyuncu et al. (Wed,) studied this question.