What question did this study set out to answer?

To assess the accuracy, readability, and comprehensiveness of responses generated by AI chatbots for pediatric contact lens questions.

March 18, 2026Open Access

Evaluating AI Chatbots for Pediatric Contact Lenses: A Study on Accuracy, Readability, and Reliability

Key Points

To assess the accuracy, readability, and comprehensiveness of responses generated by AI chatbots for pediatric contact lens questions.
Evaluated five AI chatbot platforms: ChatGPT-4o, Gemini 1.5, Perplexity, Copilot, and Claude 3.5 Sonnet.
Used 28 curated questions related to pediatric contact lenses for evaluation.
Responses were graded by two pediatric ophthalmologists using DISCERN and PEMAT-P scales and readability indices.
Expert-written responses were benchmarked for readability comparison.
Significant differences in accuracy and comprehensiveness across platforms with p=0.0216 and p=0.0067.
ChatGPT-4o scored highest in both accuracy and length among the platforms evaluated (p<0.0001).
Reproducibility was high for general pediatric queries but lower for specialized aphakic queries (p=0.041).
Factual inaccuracies were more prevalent in responses regarding aphakic contact lenses.

Abstract

This study evaluated the accuracy, readability, and comprehensiveness of patient-facing responses generated by LLM-based chatbot platforms to pediatric contact lens (CL)–related questions, using expert grading and readability benchmarking. Five platforms (ChatGPT-4o, Gemini 1.5, Perplexity, Copilot, and Claude 3.5 Sonnet) were assessed using 28 curated questions. Two pediatric ophthalmologists graded anonymized outputs using DISCERN and PEMAT-P, 5-point Likert scales for accuracy and comprehensiveness, and multiple automated readability indices. Expert-written responses were included only for readability benchmarking. ChatGPT-4o produced the longest responses (p0.0001). Accuracy and comprehensiveness differed across platforms (p=0.0216 and p=0.0067), with ChatGPT-4o scoring higher than Perplexity in post-hoc comparisons (p=0.0173 and p=0.0087). Expert responses were shorter but showed higher complexity on readability indices. Accuracy-based reproducibility was high for general pediatric CL queries but lower for aphakic CL–related questions (p=0.041), and factual inaccuracies were more frequent in aphakic topics. While LLMs may support patient education, variability in correctness and completeness underscores the need for expert oversight; these tools should complement, not replace, clinical expertise in pediatric CL usage.

Bookmark

View Full Paper

Bookmark

View Full Paper

Evaluating AI Chatbots for Pediatric Contact Lenses: A Study on Accuracy, Readability, and Reliability

Key Points

Abstract

Cite This Study