What question did this study set out to answer?

The study aims to evaluate how question types affect the accuracy of large language models on the Chinese national medical licensing exam.

April 10, 2026Open Access

Performance benchmarking of LLMs on Chinese national medical licensing education: Cross-lingual and question-type effects

Key Points

The study aims to evaluate how question types affect the accuracy of large language models on the Chinese national medical licensing exam.
Conducted a cross-sectional study using 396 educational questions from the Chinese national medical licensing examination.
Extracted 198 English-Chinese question pairs for comparison.
Prompted six LLMs for responses and computed accuracy across three question types: Type A, Type B, and Type C.
Doubao-1.5-pro achieved the highest accuracy at 92.0% ± 1.3%, while ChatGPT-4o had the lowest at 82.8% ± 3.7%.
A significant main effect of question type was observed (P = 0.0038), with Type A outperforming Types B and C.
Cross-lingual accuracy differences between Chinese and English versions remained below 5% for Doubao-1.5-pro, Deepseek-R1, and Deepseek-V3.

Abstract

Background The cross-lingual and question-type variations affecting large language models (LLMs) accuracy on the Chinese national medical licensing educations remain insufficiently explored. Methods In this cross-sectional study (May 13–20, 2025), 396 educational questions (198 English–Chinese pairs) were extracted from the Chinese national medical licensing examination. ChatGPT-4o, ChatGPT-o3, Gemini-2.5-pro, Deepseek-V3, Deepseek-R1, and Doubao-1.5-pro were prompted to provide answers. Responses were compared against reference answers, and accuracy was computed for three question types: basic knowledge (Type A), case analysis (Type B), and integrative judgment (Type C). Results Across all question types and languages, Doubao-1.5-pro achieved the highest accuracy at 92.0% ± 1.3%, whereas ChatGPT-4o had the lowest accuracy at 82.8% ± 3.7%. There was a significant main effect of question type ( P = 0.0038) but no main effect of language ( P = 0.56). Post hoc tests confirmed that Type A performance exceeded Types B and C ( P < 0.01), while B vs. C did not differ. Among the models, Doubao-1.5-pro, Deepseek-R1, and Deepseek-V3 demonstrated notable cross-lingual stability, with accuracy differences between Chinese and English versions remaining below 5%. Conclusion The question type was a key factor affecting LLMs performance on Chinese medical licensing exam questions, whereas language had no significant impact. Doubao-1.5-pro, Deepseek-R1, and Deepseek-V3 demonstrated particularly strong cross-lingual consistency. These findings point to the potential value of specialized LLMs for enhancing medical education in China.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Tang et al. (Wed,) studied this question.

synapsesocial.com/papers/69d895ea6c1944d70ce07153 https://doi.org/https://doi.org/10.1371/journal.pone.0346518

Bookmark

View Full Paper