Large Language Models (LLMs) demonstrate increasing capabilities in medical knowledge assessment, yet limitations remain in cross-population validation, direct human-AI comparisons, and evaluation of newer models in anesthesiology contexts. This study addresses these gaps by conducting a head-to-head comparison between newer LLMs and human examinees on official Israeli multiple-choice board examinations. We evaluated two LLMs (Claude 3.7 Sonnet and ChatGPT-4) against anonymized aggregate data from 381 examinees on three consecutive official Israeli anesthesiology board examinations (2023–2024), comprising 450 multiple-choice questions stratified by difficulty, discrimination ability, and topic. Each model was tested twice per exam. Claude 3.7 Sonnet achieved 73.67% accuracy, significantly outperforming both human examinees (62.77%, P < 0.001) and ChatGPT-4 (64.44%, P < 0.001). However, both LLMs performed below the upper quartile of human performance (78.05%). While LLMs excelled on easy questions and theoretical domains like cardiac physiology (Claude: 96.88%, ChatGPT-4: 81.25%), they showed lower performance in areas such as ambulatory (Claude: 30.00%, ChatGPT-4: 10.00%) and regional anesthesia (Claude: 44.44%, ChatGPT-4: 38.89%). Human examinees demonstrated consistent performance across all domains, whereas LLMs showed extreme variability. Self-consistency was substantial for both LLMs (κ = 0.66–0.68), but agreement with human responses was moderate (κ = 0.34–0.39). While advanced LLMs currently exceed average examinee performance on anesthesiology board examinations, they fall short of top-quartile examinees at present and demonstrate significant performance variability across different topic areas.
Ronen et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: