BACKGROUND: This study aimed to evaluate and compare the performance of five publicly accessible large language models (LLMs)-based chatbots, ChatGPT-4o, DeepSeek-V3, Claude-Sonnet-4, Gemini-2.0 Flash, and Grok-3, in addressing inquiries from patients with periodontitis seeking orthodontic treatment. The primary objective was to assess the reliability, quality, and readability of the LLM-generated responses. METHODS: Thirty frequently asked questions regarding orthodontic treatment for patients with periodontitis were sourced from social media platforms and health-related websites and compiled for this study. Each LLM response was evaluated for reliability using the modified DISCERN (mDISCERN) tool, quality using the Global Quality Score (GQS), and readability using the Flesch Reading Ease (FRE) and Flesch-Kincaid Grade Level (FKGL) scores. Differences among models were analysed using linear mixed-effects models, with model treated as a fixed effect and question as a random effect. Post-hoc pairwise comparisons of estimated marginal means were performed with Bonferroni's adjustment. Significance was set at P < 0.05. RESULTS: Among the evaluated LLMs, significant performance differences were observed across all metrics (P < 0.001). Grok-3 provided the highest reliability and quality (mDISCERN: 4.20 ± 0.48; GQS: 4.38 ± 0.61), whereas Claude-Sonnet-4 scored the lowest (mDISCERN: 3.54 ± 0.50; GQS: 3.63 ± 0.59). DeepSeek-V3 was rated as most readable (FRE: 33.61 ± 6.11; FKGL: 10.10 ± 1.14), whereas Claude-Sonnet-4 was the least readable (FRE: 4.73 ± 4.14; FKGL: 13.72 ± 1.22). All models produced responses with university-level readability. CONCLUSIONS: Grok-3 demonstrates higher reliability and quality, whereas DeepSeek-V3 generates more readable responses. All models exceed recommended readability thresholds for patient education. However, given the risks of misinformation and readability limitations, these should be considered supplementary educational resources, rather than primary sources of medical information.
Li et al. (Thu,) studied this question.