Key points are not available for this paper at this time.
Abstract Background Interest surrounding generative large language models (LLMs) has rapidly grown. While ChatGPT (GPT-3.5), a general LLM, has shown near-passing performance on medical student board examinations, the performance of ChatGPT or its successor GPT-4 on specialized exams and the factors affecting accuracy remain unclear. Objective To assess the performance of ChatGPT and GPT-4 on a 500-question mock neurosurgical written boards examination. Methods The Self-Assessment Neurosurgery Exams (SANS) American Board of Neurological Surgery (ABNS) Self-Assessment Exam 1 was used to evaluate ChatGPT and GPT-4. Questions were in single best answer, multiple-choice format. Chi-squared, Fisher’s exact, and univariable logistic regression tests were employed to assess performance differences in relation to question characteristics. Results ChatGPT (GPT-3.5) and GPT-4 achieved scores of 73.4% (95% confidence interval CI: 69.3-77.2%) and 83.4% (95% CI: 79.8-86.5%), respectively, relative to the user average of 73.7% (95% CI: 69.6-77.5%). Question bank users and both LLMs exceeded last year’s passing threshold of 69%. While scores between ChatGPT and question bank users were equivalent ( P =0.963), GPT-4 outperformed both (both P 0.005). Multimodal input was not available at the time of this study so, on questions with image content, ChatGPT and GPT-4 answered 49.5% and 56.8% of questions correctly based upon contextual context clues alone. Conclusion LLMs achieved passing scores on a mock 500-question neurosurgical written board examination, with GPT-4 significantly outperforming ChatGPT.
Building similarity graph...
Analyzing shared references across papers
Loading...
Rohaid Ali
Harvard University
Oliver Y. Tang
University of Pittsburgh
Ian D. Connolly
Mass General Brigham
Massachusetts General Hospital
University of Pittsburgh
Brown University
Building similarity graph...
Analyzing shared references across papers
Loading...
Ali et al. (Wed,) studied this question.
synapsesocial.com/papers/6a1c3411ea84844e355f9516 — DOI: https://doi.org/10.1101/2023.03.25.23287743
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: