March 3, 2026Open Access

Large Language Models in Emergency Medicine: A Critical Appraisal of Validity, Reproducibility, and Clinical Utility (2020-2025)

Key Points

High-level accuracy claims in large language models often do not translate into evidence useful for clinical decisions.
Only around one-third of studies showed complete prompt disclosure, with real-world data underused compared to synthetic data.
Expert benchmarking and external validation were inconsistent, posing risks to model validity and usability in emergency settings.
A minimum reporting set is essential for effective use of large language models in time-critical emergency care.

Abstract

Recent studies on large language models (LLMs) in emergency medicine (EM)-have expanded rapidly, yet core threats to validity and reproducibility remain under-addressed. We critically synthesized the methods, reporting quality, and clinical relevance of LLM-focused work in emergency care published between January 2020 and April 2025. We conducted a PubMed search and verified journal indexing in the Web of Science (WoS) to restrict screening to EM-relevant studies published in journals indexed under the WoS ‘EM’ category, excluding editorials that lacked primary or secondary analysis. Two reviewers independently coded protocol availability; prompt transparency; data realism; reference standards; calibration and decision-curve reporting; external validation; and expert benchmarking, resolving discrepancies by consensus. Ninety-one studies met the inclusion criteria; sixty were original investigations. Prompt disclosure was complete in roughly one-third of studies, and real-world clinical data were used less often than synthetic or examination-style vignettes. Calibration, decision-curve analysis, and demonstrations of incremental value over parsimonious clinical baselines were infrequently reported. Expert benchmarking appeared inconsistently across journal strata, and “near-expert” claims often relied on proxy tasks with limited ecological validity. External validation was uncommon, and model/version identifiers were frequently incomplete, undermining reproducibility. Overall, the current LLM literature within this core EM journal corpus is method-lean and report-light: high-level accuracy claims rarely translate into decision-useful evidence. A minimum reporting set—transparent prompts, code, and versioning; calibration; decision-curve analysis; and expert benchmarking on real data—is needed; absent these elements, deployment in time-critical emergency care remains premature.

Bookmark

View Full Paper

Bookmark

View Full Paper

Large Language Models in Emergency Medicine: A Critical Appraisal of Validity, Reproducibility, and Clinical Utility (2020-2025)

Key Points

Abstract

Cite This Study