< 0.05) but not for Safety. Compared with ChatGPT, lower odds of higher ratings were seen for Grok (OR 0.48) and DeepSeek (OR 0.61). Inter-rater reliability indicated moderate agreement (Fleiss' κ = 0.59) and strong consensus (Gwet's AC1 = 0.87).ConclusionChatGPT showed superior accuracy and clarity, while Gemini and Llama excelled in educational value and safety. High expert agreement supports AI chatbots as adjuncts in pediatric ophthalmology education requiring continued validation.
Shweta et al. (Wed,) studied this question.