OBJECTIVE: The rapid advancement of Large Language Models (LLMs) has generated interest in their application to medical education, particularly for high-stakes assessments like the USMLE. This study aims to evaluate the performance of DeepSeek-R1, a state-of-the-art LLM developed in China, compared to OpenAI models, to assess its feasibility for medical education and assessment. METHODS: The authors evaluated the performance of five models, including DeepSeek-R1, DeepSeek-V3, and three OpenAI models (GPT-4 Omni, OpenAI o3-mini, OpenAI o1 pro), on 321 text-based USMLE-style questions. Accuracy rates were calculated, and statistical comparisons were performed using Chi-Square tests with Bonferroni correction. RESULTS: DeepSeek-R1 achieved the highest overall accuracy of 92.5% (95% CI 89.1%‒94.9%), significantly outperforming the OpenAI models (all 78.8%, p < 0.0001). DeepSeek-R1 also surpassed the reported average human examinee performance across all USMLE steps. The inter-model consensus between DeepSeek-R1 and OpenAI o1 pro yielded 94.9% accuracy, indicating high reliability for straightforward queries. Furthermore, in discordant cases, DeepSeek-R1 demonstrated superior capability with 82.8% accuracy compared to 14.1%‒28.1% for the OpenAI models (p < 0.0001). CONCLUSION: DeepSeek-R1 emerges as a compelling candidate in the AI-driven healthcare landscape, demonstrating superior accuracy and reasoning capabilities. However, its current limitation in multimodal data processing underscores the need for further innovation. These findings provide valuable insights for educators and policymakers regarding the integration of non-Western LLMs into medical assessment.
Zhou et al. (Thu,) studied this question.