What question did this study set out to answer?

The research aims to evaluate and compare the performance of large language models and human examinees on anesthesiology board examinations.

March 28, 2026Open Access

Large language models versus human examinee performance on Israeli anesthesiology board examinations

Key Points

The research aims to evaluate and compare the performance of large language models and human examinees on anesthesiology board examinations.
Conducted a head-to-head comparison of Claude 3.7 Sonnet and ChatGPT-4 against anonymized data from 381 examinees.
Evaluated performance on three Israeli anesthesiology board examinations (2023–2024) with 450 multiple-choice questions.
Each model was tested twice per exam to assess accuracy and variability.
Claude 3.7 Sonnet achieved 73.67% accuracy, outperforming human examinees (62.77%) and ChatGPT-4 (64.44%) significantly.
Both LLMs scored lower than the upper quartile of human performance (78.05%).
LLMs excelled on easy questions, but struggled in specific domains like ambulatory and regional anesthesia.

Abstract

Large Language Models (LLMs) demonstrate increasing capabilities in medical knowledge assessment, yet limitations remain in cross-population validation, direct human-AI comparisons, and evaluation of newer models in anesthesiology contexts. This study addresses these gaps by conducting a head-to-head comparison between newer LLMs and human examinees on official Israeli multiple-choice board examinations. We evaluated two LLMs (Claude 3.7 Sonnet and ChatGPT-4) against anonymized aggregate data from 381 examinees on three consecutive official Israeli anesthesiology board examinations (2023–2024), comprising 450 multiple-choice questions stratified by difficulty, discrimination ability, and topic. Each model was tested twice per exam. Claude 3.7 Sonnet achieved 73.67% accuracy, significantly outperforming both human examinees (62.77%, P < 0.001) and ChatGPT-4 (64.44%, P < 0.001). However, both LLMs performed below the upper quartile of human performance (78.05%). While LLMs excelled on easy questions and theoretical domains like cardiac physiology (Claude: 96.88%, ChatGPT-4: 81.25%), they showed lower performance in areas such as ambulatory (Claude: 30.00%, ChatGPT-4: 10.00%) and regional anesthesia (Claude: 44.44%, ChatGPT-4: 38.89%). Human examinees demonstrated consistent performance across all domains, whereas LLMs showed extreme variability. Self-consistency was substantial for both LLMs (κ = 0.66–0.68), but agreement with human responses was moderate (κ = 0.34–0.39). While advanced LLMs currently exceed average examinee performance on anesthesiology board examinations, they fall short of top-quartile examinees at present and demonstrate significant performance variability across different topic areas.

AIに質問

Bookmark

View Full Paper