What question did this study set out to answer?

Assess the accuracy of large language models in extracting structured features from clinical and anatomic pathology notes.

February 14, 2026

Evaluation of Large-Language Models for Structured Feature Extraction of Anatomic and Clinical Pathology Reports

Key Points

Assess the accuracy of large language models in extracting structured features from clinical and anatomic pathology notes.
Evaluated GPT-4o and GPT-5 for feature extraction accuracy across three pathology datasets.
Compared LLM-derived features with manual labels from expert clinicians.
Developed a web application for rapid structured feature extraction prototyping.
LLMs achieved high accuracy with error rates near 5% for simple cases and 10% for complex cases.
Expert-LLM concordance was extremely high (κ>0.9), slightly below inter-expert concordance.
Model errors were primarily due to confusion between negative and indeterminate findings.

Abstract

Abstract Background Feature extraction via manual chart review is often used for both patient care and research, but it is time-intensive and costly. Recent improvements in natural language processing present novel opportunities to perform high-throughput automated feature extraction. Here, we assessed the accuracy of large language models (LLMs) for structured feature extraction from clinical and anatomic pathology notes. Methods We assessed the accuracy of feature extraction by the OpenAI GPT-4o and GPT-5 models across 3 pathology data sets: cardiac transplant pathology reports, hemoglobin variant test interpretations, and urine drug test interpretations. For each case, model-derived features were compared to manual labels from expert clinicians. We also developed a novel web application to enable rapid development and prototyping of structured function calls to common LLM models. Results We first developed a “toolbuilder” application to design structured feature extractions from clinical text. Using this application, current LLMs had high accuracy with error rates near 5% for simple cases and 10% for more complex use cases. Performance was strongly influenced by model type but was not drastically improved by prompt engineering or other input adaptations. Across a range of features, expert–LLM concordance was extremely high (κ0.9), and only slightly below inter-expert concordance. Model errors were most commonly due to mistakes between negative and indeterminate findings, suggesting overconfidence of the models in the presence of reduced information. Conclusion These findings suggest that LLM tools can provide significant value in automating time- and cost-intensive clinical note feature extraction and annotation.

Bookmark

Evaluation of Large-Language Models for Structured Feature Extraction of Anatomic and Clinical Pathology Reports

Key Points

Abstract

Cite This Study