March 3, 2026Open Access

Comparative performance of GPT-4, GPT-o3, GPT-5, Gemini-3-Flash, and DeepSeek-R1 in ophthalmology question answering

Puntos clave

Ophthalmology question answering reveals superior performance from GPT-o3 and Gemini-3-Flash in clinical decision support.
Notably, GPT-5 did not exceed its predecessor's accuracy or stability for medical questions.
Assessment using large language models demonstrates prompt engineering's limited effect on closed-ended queries.
Future research should focus on multimodal integration and validation in actual healthcare settings.

Resumen

GPT-o3 and Gemini-3-Flash achieve superior stability and accuracy in ophthalmology Question Answering (QA), making them suitable for high-stakes clinical decision support. The open-source model DeepSeek-R1 shows competitive potential, especially in complex tasks. Notably, GPT-5 failed to surpass its predecessor in both accuracy and consistency in this specialized domain. Prompt engineering has a limited impact on performance for closed-ended medical questions. Future work should extend to multimodal integration and real-world clinical validation to enhance the practical utility and reliability of LLMs in medicine.

Me gusta

Guardar

Ver artículo completo