February 26, 2026Open Access

Assessment of Correctness, Content Omission, and Risk of Harm in Large Language Model Responses to Ophthalmology Continuing Medical Education Questions

Key Points

Key points are not available for this paper at this time.

Abstract

Purpose: To evaluate the accuracy and prose responses of 2 large language models (LLMs) to ophthalmology continuing medical education questions. Design: Question prompts and multiple choice (MC) answer options were input into the 2 LLMs, and responses were analyzed for accuracy and assessed for evidence of correctness, completeness, bias, and potential harm using a previously reported standardized rubric. Subjects: Basic and Clinical Science Course questions and MC answer options from the American Academy of Ophthalmology question bank were used as inputs into the 2 LLMs (ChatGPT-4 and Google Vertex's Gemini Pro 1.5). Methods: The MC responses were assessed for accuracy in comparison to the question bank's designated corrected answer. The free-text prose responses from the 2 LLMs were assessed by 3 board-certified ophthalmologists. Main Outcome Measures: Accuracy and assessment of correct and incorrect reasoning, inappropriate content, missing content, possibility of bias, or possibility of harm. Results: < 0.05), respectively. Though there was high evidence of correct reasoning in the prose responses (92% and 88% for ChatGPT-4 and Gemini Pro 1.5, respectively), there was also evidence of incorrect reasoning (42% and 58%), inappropriate content (29% and 36%), missing content (42% and 30%), and possibility of physical or emotional harm (36% and 44%). Conclusions: Though ChatGPT-4 was able to perform well in MC accuracy, both LLMs contained inaccuracies, missing content, and material that could lead to harm in their prose responses. Our findings suggest that provider-guided auditing in ophthalmology is required before the use of the technology in direct patient-facing settings. Financial Disclosures: Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.

Assessment of Correctness, Content Omission, and Risk of Harm in Large Language Model Responses to Ophthalmology Continuing Medical Education Questions

Key Points

Abstract

Cite This Study