• Systematic review of transfer learning for mammography (2020–2025) • PRISMA-based synthesis of 154 studies on pretrained CNNs • Evaluates reproducibility, code availability, and bias risks • Identifies gaps in external validation and patient-level splitting • Future directions: multimodal fusion, federated learning, explainable AI Deep transfer learning has been widely applied to mammography-based breast cancer classification, with many studies reporting high diagnostic performance. However, substantial variability in datasets, validation strategies, and reporting practices complicates interpretation and clinical relevance. A systematic review was conducted following PRISMA guidelines to identify studies published between 2020 and 2025 that applied pretrained convolutional neural networks to mammographic breast cancer classification. Study characteristics, datasets, architectures, validation strategies, performance metrics, reproducibility indicators, and risk-of-bias factors were extracted and synthesized using a structured narrative approach. A total of 154 studies were included. While many report high benchmark performance, these findings often arise under limited validation conditions and must be interpreted cautiously. External validation, patient-level data splitting, and transparent reporting of code and training configurations were uncommon. Comparative synthesis revealed that reported performance was strongly influenced by dataset characteristics and validation design, with more methodologically rigorous studies generally reporting moderate but potentially more reliable results. Deep transfer learning approaches show promise for mammographic breast cancer classification, but the current literature is characterized by substantial methodological heterogeneity, limited reproducibility, and risks of bias. These findings highlight a persistent gap between benchmark performance and robust clinical applicability, underscoring the need for rigorous validation, transparent reporting, and evaluation on diverse contemporary datasets.
Oyekanmi et al. (Fri,) studied this question.