December 5, 2025Open Access

Benchmarking large language models on the United States medical licensing examination for clinical reasoning and medical licensing scenarios

Key Points

Model performance evaluated across four types of USMLE questions, indicating high accuracy.
DeepSeek achieved over 93% accuracy on Step 2 CK, outperforming other models like Grok and Qwen.
Analysis focused on clinical reasoning and error analysis, revealing issues primarily in multimodal tasks.
Results highlight the promise of LLMs in medical contexts, pointing to the need for human oversight and further benchmarking.

Abstract

Abstract Artificial intelligence (AI) is transforming healthcare by assisting with intricate clinical reasoning and diagnosis. Recent research demonstrates that large language models (LLMs), such as ChatGPT and DeepSeek, possess considerable potential in medical comprehension. This study meticulously evaluates the clinical reasoning capabilities of four advanced LLMs, including ChatGPT, DeepSeek, Grok, and Qwen, utilizing the United States Medical Licensing Examination (USMLE) as a standard benchmark. We assess 376 publicly accessible USMLE sample exam questions (Step 1, Step 2 CK, Step 3) from the most recent booklet released in July 2023. We analyze model performance across four question categories: text-only, text with image, text with mathematical reasoning, and integrated text-image-mathematical reasoning and measure model accuracy at three USMLE steps. Our findings show that DeepSeek and ChatGPT consistently outperform Grok and Qwen, with DeepSeek reaching 93% on Step 2 CK. Error analysis revealed that universal failures were rare (1. 60%) and concentrated in multimodal and quantitative reasoning tasks, suggesting both ensemble potential and shared blind spots. Compared to the baseline ChatGPT-3. 5 Turbo, newer models demonstrate substantial gains, though possible training-data exposure to USMLE content limits generalizability. Despite encouraging accuracy, models exhibited overconfidence and hallucinations, underscoring the need for human oversight. Limitations include reliance on sample questions, the small number of multimodal items, and lack of real-world datasets. Future work should expand benchmarks, integrate physician feedback, and improve reproducibility through shared prompts and configurations. Overall, these results highlight both the promise and the limitations of LLMs in medical testing: strong accuracy and complementarity, but persistent risks requiring innovation, benchmarking, and clinical oversight.

Benchmarking large language models on the United States medical licensing examination for clinical reasoning and medical licensing scenarios

Key Points

Abstract

Cite This Study