Key points are not available for this paper at this time.
Purpose: To evaluate the accuracy and prose responses of 2 large language models (LLMs) to ophthalmology continuing medical education questions. Design: Question prompts and multiple choice (MC) answer options were input into the 2 LLMs, and responses were analyzed for accuracy and assessed for evidence of correctness, completeness, bias, and potential harm using a previously reported standardized rubric. Subjects: Basic and Clinical Science Course questions and MC answer options from the American Academy of Ophthalmology question bank were used as inputs into the 2 LLMs (ChatGPT-4 and Google Vertex's Gemini Pro 1.5). Methods: The MC responses were assessed for accuracy in comparison to the question bank's designated corrected answer. The free-text prose responses from the 2 LLMs were assessed by 3 board-certified ophthalmologists. Main Outcome Measures: Accuracy and assessment of correct and incorrect reasoning, inappropriate content, missing content, possibility of bias, or possibility of harm. Results: < 0.05), respectively. Though there was high evidence of correct reasoning in the prose responses (92% and 88% for ChatGPT-4 and Gemini Pro 1.5, respectively), there was also evidence of incorrect reasoning (42% and 58%), inappropriate content (29% and 36%), missing content (42% and 30%), and possibility of physical or emotional harm (36% and 44%). Conclusions: Though ChatGPT-4 was able to perform well in MC accuracy, both LLMs contained inaccuracies, missing content, and material that could lead to harm in their prose responses. Our findings suggest that provider-guided auditing in ophthalmology is required before the use of the technology in direct patient-facing settings. Financial Disclosures: Proprietary or commercial disclosure may be found in the Footnotes and Disclosures at the end of this article.
Chen et al. (Thu,) studied this question.