What does this research mean for the field?

ChatGPT-4o and Gemini 2.0-Pro demonstrate superior performance in generating relevant and applicable responses for emergency medicine scenarios compared to other large language models. Novelty: ClaimNovelty.CONFIRMATORY. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The aim is to evaluate the performance of different large language models in complex emergency medicine scenarios.

March 7, 2026Open Access

Evaluation of large language models in emergency medicine scenarios: a comparative analysis of ChatGPT-4o, ChatGPT-o3mini, Gemini 2.0-pro, and DeepSeek-R1

LDLinfang DengChina Medical University WLWei LiuAnhui Medical University YSYuwei SunChina Medical University

Key Points

The aim is to evaluate the performance of different large language models in complex emergency medicine scenarios.
Evaluated four LLMs—ChatGPT-4o, ChatGPT-o3mini, Gemini 2.0-Pro, DeepSeek-R1.
Used 11 emergency simulation cases covering six core clinical domains.
Responses were evaluated by a panel of 20 experienced emergency physicians.
Evaluation criteria included response quality, relevance, and applicability.
Re-prompting was used to assess self-correction capabilities.
ChatGPT-4o and Gemini 2.0-Pro outperformed others in overall quality and relevance.
Gemini 2.0-Pro achieved highest scores in psychosocial support (P < 0.001).
DeepSeek-R1 surpassed ChatGPT models in applicability for psychosocial tasks (P < 0.05).
Triage was identified as the most error-prone domain.
Post-re-prompting, multiple models showed significant improvements in key areas like applicability.

Abstract

The application of large language models (LLMs) in emergency medicine has gained increasing interest, yet their performance across complex clinical domains remains underexplored. This study evaluated the performance of four LLMs—ChatGPT-4o, ChatGPT-o3mini, Gemini 2.0-Pro, and DeepSeek-R1—using 11 representative emergency simulation cases encompassing six core domains: triage, assessment and diagnosis, treatment decision-making, post-treatment management and follow-up, psychosocial support, and prognosis and rehabilitation. The responses generated by each model to domain-specific clinical queries were independently evaluated by a panel of 20 emergency physicians, all with more than eight years of clinical experience. Evaluations were based on response quality, relevance, and applicability. Misleading or inappropriate outputs were re-prompted to assess the models’ self-correction capabilities. All models demonstrated acceptable inter-rater reliability (ICC > 0.60), with an overall ICC of 0.897 indicating high consistency. ChatGPT-4o and Gemini 2.0-Pro outperformed the others in overall quality and relevance. Notably, Gemini 2.0-Pro achieved the highest relevance and applicability scores in the psychosocial support domain (P < 0.001). DeepSeek-R1 also showed strong applicability in psychosocial tasks, surpassing ChatGPT-4o and ChatGPT-o3mini (P < 0.05). Triage emerged as the most error-prone domain across all models. Following re-prompting, ChatGPT-o3mini, Gemini 2.0-Pro, and DeepSeek-R1 showed significant improvements in applicability, particularly in triage, psychosocial support, and follow-up management. These findings underscore the potential of LLMs—particularly ChatGPT-4o and Gemini 2.0-Pro—in enhancing emergency decision-making. However, the variability in task-specific performance underscores the need for further domain-specific refinement before clinical implementation. Not applicable.

Ask AI

Helpful

Bookmark

View Full Paper