Background The cross-lingual and question-type variations affecting large language models (LLMs) accuracy on the Chinese national medical licensing educations remain insufficiently explored. Methods In this cross-sectional study (May 13–20, 2025), 396 educational questions (198 English–Chinese pairs) were extracted from the Chinese national medical licensing examination. ChatGPT-4o, ChatGPT-o3, Gemini-2.5-pro, Deepseek-V3, Deepseek-R1, and Doubao-1.5-pro were prompted to provide answers. Responses were compared against reference answers, and accuracy was computed for three question types: basic knowledge (Type A), case analysis (Type B), and integrative judgment (Type C). Results Across all question types and languages, Doubao-1.5-pro achieved the highest accuracy at 92.0% ± 1.3%, whereas ChatGPT-4o had the lowest accuracy at 82.8% ± 3.7%. There was a significant main effect of question type ( P = 0.0038) but no main effect of language ( P = 0.56). Post hoc tests confirmed that Type A performance exceeded Types B and C ( P < 0.01), while B vs. C did not differ. Among the models, Doubao-1.5-pro, Deepseek-R1, and Deepseek-V3 demonstrated notable cross-lingual stability, with accuracy differences between Chinese and English versions remaining below 5%. Conclusion The question type was a key factor affecting LLMs performance on Chinese medical licensing exam questions, whereas language had no significant impact. Doubao-1.5-pro, Deepseek-R1, and Deepseek-V3 demonstrated particularly strong cross-lingual consistency. These findings point to the potential value of specialized LLMs for enhancing medical education in China.
Tang et al. (Wed,) studied this question.