Abstract Background The rapid integration of large language models (LLMs) into online medical consultations demands rigorous evaluation, particularly in specialized fields like paediatric ophthalmology. This study systematically assessed the diagnostic accuracy, safety and communication quality of advanced LLMs in real‐world paediatric ophthalmology tele‐consultations, with a focus on comparing text‐only inputs to multi‐modal inputs (text paired with parent‐taken mobile photographs). Methods A prospective, blinded, multi‐model study was conducted using 50 authentic paediatric ophthalmology cases from an internet‐based hospital platform. Four leading LLMs—Grok 4.1, GPT 5.1 Thinking, Gemini 2.5 Pro (advanced models) and GPT‐4o (baseline)—were tested in multi‐modal and text‐only scenarios. Responses were evaluated by three blinded senior paediatric ophthalmologists using composite scoring system (maximum 19 points), encompassing diagnostic accuracy (up to six points), guideline adherence and safety (up to four points) and parent communication quality (up to nine points). Results Advanced LLMs consistently outperformed GPT‐4o, driven by superior Safety and Communication efficacy. Grok 4.1 led with the highest overall score in the multi‐modal arm (15.06 ± 0.68), followed by GPT 5.1 Thinking (14.52 ± 1.19) and Gemini 2.5 Pro (14.30 ± 1.08), while GPT‐4o lagged significantly at 11.20 ± 0.72. In the text‐only arm, advanced models again excelled: Grok 4.1 (14.08 ± 1.18), GPT 5.1 Thinking (13.62 ± 1.77) and Gemini 2.5 Pro (13.42 ± 1.50). Multi‐modal inputs significantly boosted scores (e.g., Grok 4.1: Δ = 0.98, p < 0.001), with gains specifically attributable to improvements in Safety and Communication rather than diagnostics. Regarding safety, advanced models exhibited markedly fewer major harmful responses (Grok 4.1: 4%; GPT 5.1 Thinking: 8%; Gemini 2.5 Pro: 10%) compared to GPT‐4o (16%). Parent preferences favoured advanced LLMs, with Grok 4.1 receiving 23% of votes, reflecting higher perceived clarity, empathy and trustworthiness. Conclusion Advanced LLMs demonstrate markedly superior capabilities over GPT‐4o in handling paediatric ophthalmology tele‐consultations, especially with multi‐modal data, offering enhanced safety, communication and parent trust. However, persistent variations in safety across models and residual risks of harmful advice underscore the need for condition‐specific validation and stringent safety guardrails prior to deployment. Multi‐modal integration is essential for optimizing LLM reliability in this high‐stakes domain.
Kang et al. (Tue,) studied this question.