What question did this study set out to answer?

To assess the effectiveness of large language models in extracting structured data from pediatric emergency records.

May 6, 2026Open Access

Extracting structured clinical data from pediatric emergency records using LLMs: A multimodel retrospective study of children with medical complexity

Key Points

To assess the effectiveness of large language models in extracting structured data from pediatric emergency records.
Conducted a diagnostic accuracy study using retrospective data from 2007 to 2023.
Analyzed 697 anonymized emergency department records from children with medical complexity.
Evaluated model performance against manual clinician classification as the gold standard.
GPT-5.2 model achieved high accuracy for triage color (0.99) and ED outcomes (0.984).
Accuracy for laboratory tests was 0.96 and oxygen therapy was 0.95.
Processing time reduced from about 5 minutes to 6 seconds per record.

Abstract

Importance Emergency departments (EDs) face significant documentation burdens due to reliance on unstructured clinical narratives, hindering efficiency, particularly in pediatric care. Large language models (LLMs) offer a potential solution by automating data extraction to improve clinical workflows. Objective To determine whether an LLM can accurately and efficiently extract structured clinical data from free-text pediatric ED records in a non-English setting. Design Diagnostic accuracy study using retrospective data from 2007 to 2023. Manual clinician classification served as the gold standard to assess model performance. Setting Single-center study conducted at the pediatric ED of Padova University Hospital, a tertiary care referral center in Italy. Participants A convenience sample of 697 anonymized ED records from children with complex medical conditions. Exposure Automated data extraction using OpenAI's GPT-5.2 model via structured prompts processed in Python. All texts were in Italian and translated to English in the workflow. Main Outcomes and Measures Primary outcomes included accuracy, AUC, sensitivity, and specificity of the LLM in extracting triage color codes, ED outcomes, reasons for ED visit, and performed procedures. Efficiency gains were also measured by comparing manual and automated extraction times. Results Among 697 records analyzed, the primary model (GPT-5.2) achieved high accuracy in classifying triage color (0.99) and ED outcome (0.984). Accuracy for laboratory tests was 0.96, oxygen therapy 0.95, and nasogastric tube placement 0.987. Results were consistent across all seven models (mean Fleiss’ kappa = 0.922). Processing time was reduced from ∼5 min to 6 s per record, with a total cost of € 23.42. Conclusions In this study of pediatric ED encounters in a non-English setting, LLMs reliably extracted structured clinical data and substantially reduced documentation processing time. These findings supported their potential to streamline workflows, particularly in resource-constrained environments. Further research was warranted to improve classification of complex or ambiguous information.

Read Full Paperexternally

Demander à l'IA

Bookmark

View Full Paper

Cite This Study

Brigiari et al. (Sun,) studied this question.

synapsesocial.com/papers/69fa8eca04f884e66b5311ac https://doi.org/https://doi.org/10.1177/20552076261431431

Demander à l'IA

Bookmark

View Full Paper