The application of large language models (LLMs) in emergency medicine has gained increasing interest, yet their performance across complex clinical domains remains underexplored. This study evaluated the performance of four LLMs—ChatGPT-4o, ChatGPT-o3mini, Gemini 2.0-Pro, and DeepSeek-R1—using 11 representative emergency simulation cases encompassing six core domains: triage, assessment and diagnosis, treatment decision-making, post-treatment management and follow-up, psychosocial support, and prognosis and rehabilitation. The responses generated by each model to domain-specific clinical queries were independently evaluated by a panel of 20 emergency physicians, all with more than eight years of clinical experience. Evaluations were based on response quality, relevance, and applicability. Misleading or inappropriate outputs were re-prompted to assess the models’ self-correction capabilities. All models demonstrated acceptable inter-rater reliability (ICC > 0.60), with an overall ICC of 0.897 indicating high consistency. ChatGPT-4o and Gemini 2.0-Pro outperformed the others in overall quality and relevance. Notably, Gemini 2.0-Pro achieved the highest relevance and applicability scores in the psychosocial support domain (P < 0.001). DeepSeek-R1 also showed strong applicability in psychosocial tasks, surpassing ChatGPT-4o and ChatGPT-o3mini (P < 0.05). Triage emerged as the most error-prone domain across all models. Following re-prompting, ChatGPT-o3mini, Gemini 2.0-Pro, and DeepSeek-R1 showed significant improvements in applicability, particularly in triage, psychosocial support, and follow-up management. These findings underscore the potential of LLMs—particularly ChatGPT-4o and Gemini 2.0-Pro—in enhancing emergency decision-making. However, the variability in task-specific performance underscores the need for further domain-specific refinement before clinical implementation. Not applicable.
Building similarity graph...
Analyzing shared references across papers
Loading...
Linfang Deng
Wei Liu
Yuwei Sun
BMC Emergency Medicine
China Medical University
First Hospital of China Medical University
Building similarity graph...
Analyzing shared references across papers
Loading...
Deng et al. (Wed,) studied this question.
www.synapsesocial.com/papers/69abc1235af8044f7a4e9b2a — DOI: https://doi.org/10.1186/s12873-026-01511-0