Abstract Background Feature extraction via manual chart review is often used for both patient care and research, but it is time-intensive and costly. Recent improvements in natural language processing present novel opportunities to perform high-throughput automated feature extraction. Here, we assessed the accuracy of large language models (LLMs) for structured feature extraction from clinical and anatomic pathology notes. Methods We assessed the accuracy of feature extraction by the OpenAI GPT-4o and GPT-5 models across 3 pathology data sets: cardiac transplant pathology reports, hemoglobin variant test interpretations, and urine drug test interpretations. For each case, model-derived features were compared to manual labels from expert clinicians. We also developed a novel web application to enable rapid development and prototyping of structured function calls to common LLM models. Results We first developed a “toolbuilder” application to design structured feature extractions from clinical text. Using this application, current LLMs had high accuracy with error rates near 5% for simple cases and 10% for more complex use cases. Performance was strongly influenced by model type but was not drastically improved by prompt engineering or other input adaptations. Across a range of features, expert–LLM concordance was extremely high (κ0.9), and only slightly below inter-expert concordance. Model errors were most commonly due to mistakes between negative and indeterminate findings, suggesting overconfidence of the models in the presence of reduced information. Conclusion These findings suggest that LLM tools can provide significant value in automating time- and cost-intensive clinical note feature extraction and annotation.
Foy et al. (Tue,) studied this question.