What question did this study set out to answer?

This review aims to assess the performance of large language models in extracting clinical data from breast cancer pathology reports.

May 24, 2026Open Access

Performance of large language models for extracting clinical data from breast cancer pathology reports: a systematic review

Key Points

This review aims to assess the performance of large language models in extracting clinical data from breast cancer pathology reports.
Conducted a systematic search across seven databases following PRISMA guidelines.
Included nine studies evaluating over 30 LLM architectures across approximately 14,161 reports.
Assessed methodological quality using PROBAST + AI and reporting completeness with TRIPOD + AI.
Best-performing models achieved accuracy between 87.7% to 97.4%, though not directly comparable.
55.6% of studies were rated low concern/risk in methodological quality, with variability in outcomes.
Identified gaps in fairness reporting and methodological standards.

Abstract

Abstract Breast cancer pathology reports contain critical clinical information, yet manual extraction of structured data remains resource-intensive and error-prone. Large language models (LLMs) offer promising automated approaches, but no systematic synthesis examines their performance specifically for breast cancer pathology report processing. Following PRISMA guidelines, we searched seven databases from inception to December 2025, with two reviewers independently screening studies and extracting data. Methodological quality was assessed using PROBAST + AI and reporting completeness using TRIPOD + AI. Nine studies met inclusion criteria, evaluating over 30 distinct LLM architectures across datasets totaling approximately 14,161 reports. Best-performing models achieved study-specific accuracy ranging from 87.7% to 97.4%, though figures are not directly comparable across studies due to differences in task formulation, target data elements, and evaluation metrics. PROBAST + AI assessment found 55.6% of studies at low concern/risk across all domains, with the Outcome domain showing greatest variability. TRIPOD + AI revealed gaps in fairness reporting, open science practices, and patient/public involvement. LLMs demonstrate promising performance approaching human-level accuracy, but methodological quality varies, with key concerns regarding reference standard development, limited external validation, and inadequate fairness reporting.

Mark Helpful

Bookmark

Relay

View Full Paper