What does this research mean for the field?

Among large language models answering orthodontic queries for periodontitis patients, Grok-3 provides the highest reliability and quality, and DeepSeek-V3 offers the best readability, though all evaluated models produce responses that exceed recommended readability thresholds for patient education. Novelty: ClaimNovelty.INCREMENTAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The study assesses the reliability, quality, and readability of chatbot responses for orthodontic patients with periodontitis.

May 3, 2026

Evaluating large language models for orthodontic consultation in patients with periodontitis: a study of reliability, quality, and readability.

Key Points

The study assesses the reliability, quality, and readability of chatbot responses for orthodontic patients with periodontitis.
Evaluated five large language models (ChatGPT-4o, DeepSeek-V3, Claude-Sonnet-4, Gemini-2.0 Flash, Grok-3).
Used modified DISCERN tool for reliability, Global Quality Score for quality, and Flesch Reading Ease and Flesch-Kincaid Grade Level for readability.
Applied linear mixed-effects models for analysis, with significance set at P < 0.05.
Grok-3 scored highest in reliability (mDISCERN: 4.20 ± 0.48) and quality (GQS: 4.38 ± 0.61).
Claude-Sonnet-4 had the lowest reliability (mDISCERN: 3.54 ± 0.50) and quality (GQS: 3.63 ± 0.59).
DeepSeek-V3 was the most readable (FRE: 33.61 ± 6.11; FKGL: 10.10 ± 1.14), while Claude-Sonnet-4 was the least readable (FRE: 4.73 ± 4.14; FKGL: 13.72 ± 1.22).

Abstract

BACKGROUND: This study aimed to evaluate and compare the performance of five publicly accessible large language models (LLMs)-based chatbots, ChatGPT-4o, DeepSeek-V3, Claude-Sonnet-4, Gemini-2.0 Flash, and Grok-3, in addressing inquiries from patients with periodontitis seeking orthodontic treatment. The primary objective was to assess the reliability, quality, and readability of the LLM-generated responses. METHODS: Thirty frequently asked questions regarding orthodontic treatment for patients with periodontitis were sourced from social media platforms and health-related websites and compiled for this study. Each LLM response was evaluated for reliability using the modified DISCERN (mDISCERN) tool, quality using the Global Quality Score (GQS), and readability using the Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) scores. Differences among models were analysed using linear mixed-effects models, with model treated as a fixed effect and question as a random effect. Post-hoc pairwise comparisons of estimated marginal means were performed with Bonferroni's adjustment. Significance was set at P < 0.05. RESULTS: Among the evaluated LLMs, significant performance differences were observed across all metrics (P < 0.001). Grok-3 provided the highest reliability and quality (mDISCERN: 4.20 ± 0.48; GQS: 4.38 ± 0.61), whereas Claude-Sonnet-4 scored the lowest (mDISCERN: 3.54 ± 0.50; GQS: 3.63 ± 0.59). DeepSeek-V3 was rated as most readable (FRE: 33.61 ± 6.11; FKGL: 10.10 ± 1.14), whereas Claude-Sonnet-4 was the least readable (FRE: 4.73 ± 4.14; FKGL: 13.72 ± 1.22). All models produced responses with university-level readability. CONCLUSIONS: Grok-3 demonstrates higher reliability and quality, whereas DeepSeek-V3 generates more readable responses. All models exceed recommended readability thresholds for patient education. However, given the risks of misinformation and readability limitations, these should be considered supplementary educational resources, rather than primary sources of medical information.

Mark Helpful

Bookmark

Relay