Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations

Key Points

Key points are not available for this paper at this time.

Abstract

Abstract Background Interest surrounding generative large language models (LLMs) has rapidly grown. While ChatGPT (GPT-3.5), a general LLM, has shown near-passing performance on medical student board examinations, the performance of ChatGPT or its successor GPT-4 on specialized exams and the factors affecting accuracy remain unclear. Objective To assess the performance of ChatGPT and GPT-4 on a 500-question mock neurosurgical written boards examination. Methods The Self-Assessment Neurosurgery Exams (SANS) American Board of Neurological Surgery (ABNS) Self-Assessment Exam 1 was used to evaluate ChatGPT and GPT-4. Questions were in single best answer, multiple-choice format. Chi-squared, Fisher’s exact, and univariable logistic regression tests were employed to assess performance differences in relation to question characteristics. Results ChatGPT (GPT-3.5) and GPT-4 achieved scores of 73.4% (95% confidence interval CI: 69.3-77.2%) and 83.4% (95% CI: 79.8-86.5%), respectively, relative to the user average of 73.7% (95% CI: 69.6-77.5%). Question bank users and both LLMs exceeded last year’s passing threshold of 69%. While scores between ChatGPT and question bank users were equivalent ( P =0.963), GPT-4 outperformed both (both P 0.005). Multimodal input was not available at the time of this study so, on questions with image content, ChatGPT and GPT-4 answered 49.5% and 56.8% of questions correctly based upon contextual context clues alone. Conclusion LLMs achieved passing scores on a mock 500-question neurosurgical written board examination, with GPT-4 significantly outperforming ChatGPT.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Rohaid Ali

Harvard University

Oliver Y. Tang

University of Pittsburgh

Ian D. Connolly

Mass General Brigham

Actions

Institutions

Massachusetts General Hospital

University of Pittsburgh

Brown University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Ali et al. (Wed,) studied this question.

synapsesocial.com/papers/6a1c3411ea84844e355f9516 — DOI: https://doi.org/10.1101/2023.03.25.23287743

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

A deep learning system for differential diagnosis of skin diseases· 2020 · 739 citations
Study Behaviors and USMLE Step 1 Performance: Implications of a Student Self-Directed Parallel Curriculum· 2017 · 168 citations
On Chatbots and Generative Artificial Intelligence· 2023 · 20 citations
How to develop machine learning models for healthcare· 2019 · 279 citations
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models· 2023 · 3,524 citations

Also consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

A deep learning system for differential diagnosis of skin diseases· 2020 · 739 citations
Study Behaviors and USMLE Step 1 Performance: Implications of a Student Self-Directed Parallel Curriculum· 2017 · 168 citations
On Chatbots and Generative Artificial Intelligence· 2023 · 20 citations
How to develop machine learning models for healthcare· 2019 · 279 citations
Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models· 2023 · 3,524 citations

Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider