What question did this study set out to answer?

To evaluate and compare the performance of DeepSeek-R1 and GPT-4o API on the Chinese Health Professional and Technical Examination.

January 24, 2026Open Access

Performance of DeepSeek and ChatGPT on the Chinese Health Professional and Technical Examination: A comparative study

Key Points

To evaluate and compare the performance of DeepSeek-R1 and GPT-4o API on the Chinese Health Professional and Technical Examination.
Utilized 400 official multiple-choice practice questions categorized into competency units and types.
Assessed overall accuracy, response consistency, and consistent accuracy between the models.
Conducted stratified analyses and statistical comparisons using chi-square tests with multiple-comparison corrections.
DeepSeek-R1 achieved 88.5% accuracy compared to GPT-4o API's 67.9% (P < 0.001).
GPT-4o API showed 96.5% response consistency, while DeepSeek-R1 had 88.5%.
DeepSeek-R1 had higher consistent accuracy (84.0%) versus GPT-4o API's 66.7% across several nursing domains.

Abstract

Background Large language models (LLMs) are increasingly applied in medical education, yet their reliability in specialized, high-stakes assessments such as the Chinese Health Professional and Technical Examination remains unclear. DeepSeek-R1, a recently released reasoning-enhanced LLM, has shown promising performance, but empirical evidence within nursing examination contexts is limited. Objective To compare the performance of DeepSeek-R1 and the GPT-4o API on the Chinese Health Professional and Technical Examination (Intermediate Nursing), focusing on accuracy, response consistency, and consistent accuracy. Methods Four hundred official practice examination multiple-choice questions were categorized into four competency units and two question types (A/B). Both models were evaluated using overall accuracy, consistency (agreement across repeated responses), and consistent accuracy (proportion of responses that were both consistent and correct). Stratified analyses were performed across units, question types, and disciplines. Chi-square tests were used for statistical comparison, and Holm–Bonferroni correction was applied for multiple comparisons. Results DeepSeek-R1 demonstrated significantly higher overall accuracy than the GPT-4o API (88.5% vs. 67.9%, P < 0.001). GPT-4o API showed higher response consistency (96.5% vs. 88.5%) but lower consistent accuracy (66.7% vs. 84.0%). After multiple-comparison correction, significant differences in consistent accuracy remained in basic knowledge, professional knowledge, professional practice ability and Type A questions, as well as in surgical and gynecological nursing disciplines, while other domains showed no statistically significant differences. Conclusion DeepSeek-R1 outperformed the GPT-4o API across multiple dimensions of nursing competency assessment, particularly in overall accuracy and consistent accuracy. GPT-4o API exhibited high response stability but a tendency toward systematic errors, underscoring the need for careful interpretation of model outputs. Further research is needed to evaluate LLM performance using open-ended clinical reasoning tasks and real-world assessment data to support safe and effective educational integration.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper