What question did this study set out to answer?

This research aims to evaluate the performance and reliability of large language models (LLMs) for extracting axillary surgery types from clinical texts in breast cancer cases.

May 30, 2026

Accuracy and consensus strategies for large language model abstraction of axillary surgery type from a breast cancer oncologic history text.

Key Points

This research aims to evaluate the performance and reliability of large language models (LLMs) for extracting axillary surgery types from clinical texts in breast cancer cases.
Analyzed 100 breast cancer clinical texts with varied complexities manually labeled by two physicians to establish a benchmark.
Evaluated four LLMs (Gemini-3, GPT-5.2, Claude-4.5, Grok-4) using a single-pass prompt to classify axillary surgery types.
Utilized ensemble strategies with varying agreement thresholds to enhance abstraction accuracy.
Physician accuracy reached 92% and 96%, with a κ of 0.748 indicating high agreement on labeled cases.
Gemini-3 and Grok-4 achieved 88% overall accuracy, while Claude-4.5 maintained 91% conditional accuracy despite 12% abstention.
Unanimous model agreement auto-labeled 51% of cases with 96% conditional accuracy, showing effective selective automation.

Abstract

e13702 Background: Accurate abstraction of axillary surgery from unstructured oncologic history is essential for breast cancerresearch, quality assessment, and downstream data modeling. Large language models (LLMs) offer ascalable approach but vary in performance and reliability. We evaluated multiple LLMs and ensemblestrategies for high-confidence axillary surgery abstraction from real-world clinical texts. Methods: Unedited oncologic history narratives were manually copied verbatim from clinical oncologichistory section for 100 breast cancer cases (EPIC EHR) with varied note complexities and adjudicatedaxillary surgery outcomes (sentinel lymph node biopsy SLNB, axillary lymph node dissection ALND, orno axillary surgery NONE). Two physician abstractors (TD, DH) independently labeled all cases toestablish a finalized ground-truth benchmark. Four contemporary LLMs from different vendors (Gemini-3,GPT-5.2, Claude-4.5, Grok-4) were evaluated using an identical single-pass prompt instructing models toclassify as SLNB, ALND, or NONE with abstention permitted. Model outputs were compared against thebenchmark. Performance metrics included coverage, overall accuracy, and conditional accuracy (amongnon-abstaining predictions), with class imbalance considered. Ensemble strategies used increasingagreement thresholds (≥2/4, ≥3/4, unanimous 4/4). All data were de-identified and analyzed under IRB.This was a proof-of-concept evaluation of abstraction reliability and ensemble confidence strategies. Results: Ground-truth distribution was SLNB 68%, NONE 19%, ALND 13%. Physician accuracy vs.finalized label was TD 92%, DH 96%; when agreed (88% cases, κ = 0.748), conditional accuracy was100%. Among LLMs, Gemini-3 and Grok-4 achieved highest performance (88% overall accuracy each;100% coverage). Claude-4.5 abstained in 12% but had 91% conditional accuracy. GPT-5.2 showed 60%overall accuracy. Ensembles revealed precision–coverage tradeoff: unanimous 4-of-4 agreement auto-labeled 51% cases with 96% conditional accuracy; 3-of-4 auto-labeled 88% with 92% conditionalaccuracy. ≥2-model agreement did not exceed best individual models. Unanimous failures were rare(3%), with manual review identifying temporal anchoring to prior procedures and incomplete OncologicHistory summaries, reflecting real world documentation limitations. Conclusions: LLMs can accurately abstract axillary surgery from unstructured text, approaching but notmatching expert performance, with model-specific error modes. Ensemble agreement supports selectiveautomation, auto-labeling high-confidence cases while routing ambiguities for human review. Multi-modelunanimity yields low-error subsets suitable for clean data generation. These findings support human-in-the-loop frameworks for scalable oncology data abstraction.

Mark Helpful

Bookmark

Relay