What question did this study set out to answer?

This research evaluates the diagnostic accuracy and safety of advanced language models in paediatric tele-ophthalmology consultations.

April 23, 2026

Multi‐modal large language models for paediatric tele‐ophthalmology: A blinded real‐world evaluation of diagnostic accuracy and safety

Key Points

This research evaluates the diagnostic accuracy and safety of advanced language models in paediatric tele-ophthalmology consultations.
Conducted a prospective, blinded, multi-modal study with 50 paediatric ophthalmology cases.
Compared performance of four large language models using both text-only and multi-modal inputs.
Responses were evaluated by three senior paediatric ophthalmologists through a composite scoring system.
Grok 4.1 outperformed other models in both multi-modal and text-only scenarios, achieving the highest scores.
Multi-modal inputs significantly increased scores, particularly enhancing safety and communication aspects.
Advanced models showed fewer harmful responses compared to GPT-4o, with Grok 4.1 having only 4% major harmful responses.

Abstract

Abstract Background The rapid integration of large language models (LLMs) into online medical consultations demands rigorous evaluation, particularly in specialized fields like paediatric ophthalmology. This study systematically assessed the diagnostic accuracy, safety and communication quality of advanced LLMs in real‐world paediatric ophthalmology tele‐consultations, with a focus on comparing text‐only inputs to multi‐modal inputs (text paired with parent‐taken mobile photographs). Methods A prospective, blinded, multi‐model study was conducted using 50 authentic paediatric ophthalmology cases from an internet‐based hospital platform. Four leading LLMs—Grok 4.1, GPT 5.1 Thinking, Gemini 2.5 Pro (advanced models) and GPT‐4o (baseline)—were tested in multi‐modal and text‐only scenarios. Responses were evaluated by three blinded senior paediatric ophthalmologists using composite scoring system (maximum 19 points), encompassing diagnostic accuracy (up to six points), guideline adherence and safety (up to four points) and parent communication quality (up to nine points). Results Advanced LLMs consistently outperformed GPT‐4o, driven by superior Safety and Communication efficacy. Grok 4.1 led with the highest overall score in the multi‐modal arm (15.06 ± 0.68), followed by GPT 5.1 Thinking (14.52 ± 1.19) and Gemini 2.5 Pro (14.30 ± 1.08), while GPT‐4o lagged significantly at 11.20 ± 0.72. In the text‐only arm, advanced models again excelled: Grok 4.1 (14.08 ± 1.18), GPT 5.1 Thinking (13.62 ± 1.77) and Gemini 2.5 Pro (13.42 ± 1.50). Multi‐modal inputs significantly boosted scores (e.g., Grok 4.1: Δ = 0.98, p < 0.001), with gains specifically attributable to improvements in Safety and Communication rather than diagnostics. Regarding safety, advanced models exhibited markedly fewer major harmful responses (Grok 4.1: 4%; GPT 5.1 Thinking: 8%; Gemini 2.5 Pro: 10%) compared to GPT‐4o (16%). Parent preferences favoured advanced LLMs, with Grok 4.1 receiving 23% of votes, reflecting higher perceived clarity, empathy and trustworthiness. Conclusion Advanced LLMs demonstrate markedly superior capabilities over GPT‐4o in handling paediatric ophthalmology tele‐consultations, especially with multi‐modal data, offering enhanced safety, communication and parent trust. However, persistent variations in safety across models and residual risks of harmful advice underscore the need for condition‐specific validation and stringent safety guardrails prior to deployment. Multi‐modal integration is essential for optimizing LLM reliability in this high‐stakes domain.

Bookmark

Cite This Study

Kang et al. (Tue,) studied this question.

synapsesocial.com/papers/69e9bb2285696592c86ed032 https://doi.org/https://doi.org/10.1111/aos.70136

Bookmark