Previous research has highlighted the importance of standardized residency training (SRT) in cultivating competent medical specialists, with qualification examinations serving as a decisive step. Recent advances in large language models (LLMs) have drawn growing interest in their potential role in medical assessment. The present research investigates the performance of GPT-4O (gpt-4o-2024-05-13) and Claude 3 Opus (claude-3-opus-20240229) in the context of China’s SRT Assessments, focusing on two distinct roles: as AI examinees and as automated exam item generators. We conducted a comparative evaluation using real-world orthopedic and general surgery SRT exam questions in both Chinese and English. In addition, both models were tasked with generating exam questions, which were reviewed by independent medical experts for content validity, curriculum alignment, and psychometric properties. Statistical analyses included answer accuracy, item qualification rates, content coverage, internal consistency, and criterion validity. Findings showed that GPT-4O achieved over 79% answer accuracy across languages and specialties, consistently outperforming Claude 3 Opus. Items generated by GPT-4O exhibited higher qualification rates (89.3% vs. 62.9%), superior curriculum alignment (91.7% vs. 62.2%), and stronger psychometric quality. Moreover, a strong positive correlation (r = 0.707) between GPT-4O-generated exam scores and historical student performance confirmed their practical relevance. The present study demonstrates that LLMs can effectively serve dual roles in medical education, functioning both as reliable test-takers and as effective question generators. However, their application requires expert oversight and adherence to ethical standards to ensure validity in high-stakes assessments. • Both GPT-4O and Claude 3 Opus achieved satisfactory accuracy on real SRT exam questions across languages, with GPT-4O reaching an impressive >80% accuracy. • Both GPT-4O and Claude 3 Opus-generated items had higher qualification and curriculum alignment rates, with both achieving a satisfactory level. • GPT-4O outperformed Claude 3 Opus in psychometric quality and consistency • Expert-reviewed AI-authored items showed strong content and construct validity • Study shows LLMs can assist in scalable, high-quality medical exam development
Zhong et al. (Sun,) studied this question.