Abstract Breast cancer pathology reports contain critical clinical information, yet manual extraction of structured data remains resource-intensive and error-prone. Large language models (LLMs) offer promising automated approaches, but no systematic synthesis examines their performance specifically for breast cancer pathology report processing. Following PRISMA guidelines, we searched seven databases from inception to December 2025, with two reviewers independently screening studies and extracting data. Methodological quality was assessed using PROBAST + AI and reporting completeness using TRIPOD + AI. Nine studies met inclusion criteria, evaluating over 30 distinct LLM architectures across datasets totaling approximately 14,161 reports. Best-performing models achieved study-specific accuracy ranging from 87.7% to 97.4%, though figures are not directly comparable across studies due to differences in task formulation, target data elements, and evaluation metrics. PROBAST + AI assessment found 55.6% of studies at low concern/risk across all domains, with the Outcome domain showing greatest variability. TRIPOD + AI revealed gaps in fairness reporting, open science practices, and patient/public involvement. LLMs demonstrate promising performance approaching human-level accuracy, but methodological quality varies, with key concerns regarding reference standard development, limited external validation, and inadequate fairness reporting.
Shankar et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: