What question did this study set out to answer?

This research aims to assess and compare the performance of GPT-4O and Claude 3 Opus in medical examination contexts.

February 22, 2026Open Access

When AI Meets Medical Assessment: A Comparative Study of GPT-4O and Claude 3 Opus in China’s Standardized Resident Physician Examinations

Key Points

This research aims to assess and compare the performance of GPT-4O and Claude 3 Opus in medical examination contexts.
Conducted a comparative evaluation of AI models using real-world SRT exam questions.
Evaluated both models as AI examinees and as automated question generators.
Assessed exam questions for content validity, curriculum alignment, and psychometric properties.
Utilized statistical analyses including answer accuracy and internal consistency.
GPT-4O achieved over 79% answer accuracy, outperforming Claude 3 Opus.
GPT-4O generated items had higher qualification rates (89.3% vs. 62.9%) and currency alignment (91.7% vs. 62.2%).
Strong positive correlation (r = 0.707) between GPT-4O-generated exam scores and historical student performance.

Abstract

Previous research has highlighted the importance of standardized residency training (SRT) in cultivating competent medical specialists, with qualification examinations serving as a decisive step. Recent advances in large language models (LLMs) have drawn growing interest in their potential role in medical assessment. The present research investigates the performance of GPT-4O (gpt-4o-2024-05-13) and Claude 3 Opus (claude-3-opus-20240229) in the context of China’s SRT Assessments, focusing on two distinct roles: as AI examinees and as automated exam item generators. We conducted a comparative evaluation using real-world orthopedic and general surgery SRT exam questions in both Chinese and English. In addition, both models were tasked with generating exam questions, which were reviewed by independent medical experts for content validity, curriculum alignment, and psychometric properties. Statistical analyses included answer accuracy, item qualification rates, content coverage, internal consistency, and criterion validity. Findings showed that GPT-4O achieved over 79% answer accuracy across languages and specialties, consistently outperforming Claude 3 Opus. Items generated by GPT-4O exhibited higher qualification rates (89.3% vs. 62.9%), superior curriculum alignment (91.7% vs. 62.2%), and stronger psychometric quality. Moreover, a strong positive correlation (r = 0.707) between GPT-4O-generated exam scores and historical student performance confirmed their practical relevance. The present study demonstrates that LLMs can effectively serve dual roles in medical education, functioning both as reliable test-takers and as effective question generators. However, their application requires expert oversight and adherence to ethical standards to ensure validity in high-stakes assessments. • Both GPT-4O and Claude 3 Opus achieved satisfactory accuracy on real SRT exam questions across languages, with GPT-4O reaching an impressive >80% accuracy. • Both GPT-4O and Claude 3 Opus-generated items had higher qualification and curriculum alignment rates, with both achieving a satisfactory level. • GPT-4O outperformed Claude 3 Opus in psychometric quality and consistency • Expert-reviewed AI-authored items showed strong content and construct validity • Study shows LLMs can assist in scalable, high-quality medical exam development

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Zhong et al. (Sun,) studied this question.

synapsesocial.com/papers/699a9ceb482488d673cd2b04 https://doi.org/https://doi.org/10.1016/j.chbr.2026.100974

Bookmark

View Full Paper