Background Large language models (LLMs) are increasingly applied in medical education, yet their reliability in specialized, high-stakes assessments such as the Chinese Health Professional and Technical Examination remains unclear. DeepSeek-R1, a recently released reasoning-enhanced LLM, has shown promising performance, but empirical evidence within nursing examination contexts is limited. Objective To compare the performance of DeepSeek-R1 and the GPT-4o API on the Chinese Health Professional and Technical Examination (Intermediate Nursing), focusing on accuracy, response consistency, and consistent accuracy. Methods Four hundred official practice examination multiple-choice questions were categorized into four competency units and two question types (A/B). Both models were evaluated using overall accuracy, consistency (agreement across repeated responses), and consistent accuracy (proportion of responses that were both consistent and correct). Stratified analyses were performed across units, question types, and disciplines. Chi-square tests were used for statistical comparison, and Holm–Bonferroni correction was applied for multiple comparisons. Results DeepSeek-R1 demonstrated significantly higher overall accuracy than the GPT-4o API (88.5% vs. 67.9%, P < 0.001). GPT-4o API showed higher response consistency (96.5% vs. 88.5%) but lower consistent accuracy (66.7% vs. 84.0%). After multiple-comparison correction, significant differences in consistent accuracy remained in basic knowledge, professional knowledge, professional practice ability and Type A questions, as well as in surgical and gynecological nursing disciplines, while other domains showed no statistically significant differences. Conclusion DeepSeek-R1 outperformed the GPT-4o API across multiple dimensions of nursing competency assessment, particularly in overall accuracy and consistent accuracy. GPT-4o API exhibited high response stability but a tendency toward systematic errors, underscoring the need for careful interpretation of model outputs. Further research is needed to evaluate LLM performance using open-ended clinical reasoning tasks and real-world assessment data to support safe and effective educational integration.
Li et al. (Thu,) studied this question.