Recent studies on large language models (LLMs) in emergency medicine (EM)-have expanded rapidly, yet core threats to validity and reproducibility remain under-addressed. We critically synthesized the methods, reporting quality, and clinical relevance of LLM-focused work in emergency care published between January 2020 and April 2025. We conducted a PubMed search and verified journal indexing in the Web of Science (WoS) to restrict screening to EM-relevant studies published in journals indexed under the WoS ‘EM’ category, excluding editorials that lacked primary or secondary analysis. Two reviewers independently coded protocol availability; prompt transparency; data realism; reference standards; calibration and decision-curve reporting; external validation; and expert benchmarking, resolving discrepancies by consensus. Ninety-one studies met the inclusion criteria; sixty were original investigations. Prompt disclosure was complete in roughly one-third of studies, and real-world clinical data were used less often than synthetic or examination-style vignettes. Calibration, decision-curve analysis, and demonstrations of incremental value over parsimonious clinical baselines were infrequently reported. Expert benchmarking appeared inconsistently across journal strata, and “near-expert” claims often relied on proxy tasks with limited ecological validity. External validation was uncommon, and model/version identifiers were frequently incomplete, undermining reproducibility. Overall, the current LLM literature within this core EM journal corpus is method-lean and report-light: high-level accuracy claims rarely translate into decision-useful evidence. A minimum reporting set—transparent prompts, code, and versioning; calibration; decision-curve analysis; and expert benchmarking on real data—is needed; absent these elements, deployment in time-critical emergency care remains premature.
Aykut et al. (Tue,) studied this question.