What question did this study set out to answer?

This study evaluates DeepSeek-R1's performance compared to OpenAI models in medical assessments like the USMLE.

June 18, 2026Open Access

A comparative benchmark of DeepSeek-R1 on the USMLE: surpassing human and AI performance averages

Key Points

This study evaluates DeepSeek-R1's performance compared to OpenAI models in medical assessments like the USMLE.
Evaluated five models including DeepSeek-R1 and three OpenAI models on 321 USMLE-style questions.
Calculated accuracy rates and performed statistical comparisons using Chi-Square tests with Bonferroni correction.
DeepSeek-R1 achieved 92.5% accuracy, significantly higher than OpenAI models' 78.8%, p < 0.0001.
DeepSeek-R1 surpassed the average human performance on all USMLE steps with 94.9% inter-model accuracy with OpenAI o1 pro.
In discordant cases, DeepSeek-R1 obtained 82.8% accuracy compared to 14.1%-28.1% for OpenAI models, p < 0.0001.

Abstract

OBJECTIVE: The rapid advancement of Large Language Models (LLMs) has generated interest in their application to medical education, particularly for high-stakes assessments like the USMLE. This study aims to evaluate the performance of DeepSeek-R1, a state-of-the-art LLM developed in China, compared to OpenAI models, to assess its feasibility for medical education and assessment. METHODS: The authors evaluated the performance of five models, including DeepSeek-R1, DeepSeek-V3, and three OpenAI models (GPT-4 Omni, OpenAI o3-mini, OpenAI o1 pro), on 321 text-based USMLE-style questions. Accuracy rates were calculated, and statistical comparisons were performed using Chi-Square tests with Bonferroni correction. RESULTS: DeepSeek-R1 achieved the highest overall accuracy of 92.5% (95% CI 89.1%‒94.9%), significantly outperforming the OpenAI models (all 78.8%, p < 0.0001). DeepSeek-R1 also surpassed the reported average human examinee performance across all USMLE steps. The inter-model consensus between DeepSeek-R1 and OpenAI o1 pro yielded 94.9% accuracy, indicating high reliability for straightforward queries. Furthermore, in discordant cases, DeepSeek-R1 demonstrated superior capability with 82.8% accuracy compared to 14.1%‒28.1% for the OpenAI models (p < 0.0001). CONCLUSION: DeepSeek-R1 emerges as a compelling candidate in the AI-driven healthcare landscape, demonstrating superior accuracy and reasoning capabilities. However, its current limitation in multimodal data processing underscores the need for further innovation. These findings provide valuable insights for educators and policymakers regarding the integration of non-Western LLMs into medical assessment.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper