What type of study is this?

September 10, 2025

A Comparative Study on Generative Artificial Intelligence by Evaluating Multiple Large Language Models for Guidance to Parents Toward Pediatric Dentistry: A Multimodal Comparative LLM Study

Key Points

ChatGPT-4o achieved the highest quality score of 4.40, indicating superior content quality for pediatric dental guidance.
Claude 3.7 Sonnet provided the most readable responses, with a Flesch Reading Ease score of 76.29, suitable for caregivers.
Cross-sectional evaluation involved nine LLMs assessed by 10 pediatric dentists, focusing on quality and originality.
Variables among LLMs imply careful integration is needed for effective pediatric dental guidance.

Abstract

Abstract Aim: To compare the clinical quality, readability, and originality of responses from nine generative large language models (LLMs) to pediatric dental queries commonly asked by caregivers. Materials and Methods: A cross-sectional evaluation of nine LLMs (ChatGPT-3.5, ChatGPT-4o, Claude 3.5 Haiku, Claude 3.7 Sonnet, Gemini 2.0, Gemini 2.5, Grok-3, Grok-3 Mini, and DeepSeek-V3) was conducted using 20 standardized open-ended pediatric dental questions. Responses were rated by 10 pediatric dentists using the Modified Global Quality Scale (MGQS). Readability was assessed via flesch reading ease and flesch–kincaid grade level, and originality was analyzed using Turnitin®. One-way analysis of variance with post hoc tests and Cohen’s Kappa were applied. Results: ChatGPT-4o achieved the highest MGQS score (4.40 ± 0.30, P < 0.001), while DeepSeek-V3 performed the lowest (2.02 ± 0.25). Claude 3.7 Sonnet produced the most readable responses (FRE 76.29 ± 10.77), whereas Grok-3 Mini was the most complex (FKGL 14.10 ± 3.90). All LLMs demonstrated high originality (<17% similarity), with Claude 3.5 Haiku and Grok-3 Mini showing the lowest overlap (2%). Inter-rater agreement was substantial (κ = 0.72). Conclusion: ChatGPT-4o demonstrated superior content quality, while Claude 3.7 Sonnet and Gemini 2.5 provided more user-friendly readability. Performance variability among LLMs warrants cautious integration into pediatric dental guidance.

Bookmark

A Comparative Study on Generative Artificial Intelligence by Evaluating Multiple Large Language Models for Guidance to Parents Toward Pediatric Dentistry: A Multimodal Comparative LLM Study

Key Points

Abstract

Cite This Study

Also Consider

Also Consider