What does this research mean for the field?

Contemporary large language models demonstrate moderate-to-good proficiency in answering prosthodontic patient queries, with comparable scientific accuracy across platforms but significant variations in comprehensiveness, necessitating professional oversight. Novelty: ClaimNovelty.INCREMENTAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

The study assesses the effectiveness of six large language models in providing informative responses to prosthodontic patient queries.

May 29, 2026Open Access

Comparative efficacy of six contemporary large language models in prosthodontic patient education: a multi-parameter blinded assessment

Key Points

The study assesses the effectiveness of six large language models in providing informative responses to prosthodontic patient queries.
Descriptive, cross-sectional comparative study design
Ten standardized patient questions posed to six LLMs
Responses evaluated by two blinded prosthodontic specialists using a 5-point Likert scale
Claude 3.7 Sonnet scored highest with 3.79 ± 0.74, followed by Gemini 2.0 at 3.75 ± 0.58.
Comprehensiveness ratings varied significantly, with Gemini 2.0 at 3.90 ± 0.66 being more comprehensive than Microsoft Copilot at 2.95 ± 0.55.
All models failed to attain 'very good' ratings (≥ 4.5) across parameters, indicating room for improvement.

Abstract

Abstract Background This study aimed to evaluate and benchmark the proficiency of six major large language models (LLMs)—specifically the versions publicly available as of March 2025 (ChatGPT-4o, Claude 3.7 Sonnet, Microsoft Copilot, DeepSeek-V2, Gemini 2.0, and Grok-3)—in addressing prosthodontic patient inquiries across various clinical dimensions, including scientific accuracy, comprehensiveness, clarity, and relevance. Methods In this descriptive, cross-sectional comparative study, ten standardized patient questions encompassing key prosthodontic topics (fixed prostheses, removable prostheses, dental implants, and aesthetic restorations) were systematically posed to six current LLM platforms. Each model was assigned a prosthodontic specialist persona through standardized prompts. A total of 60 independent responses (representing unique combinations of the 10 questions and 6 models) were obtained. These were evaluated by two experienced prosthodontic specialists (one Professor with 15 + years, one Associate Professor with 8 + years of clinical and academic experience) using a validated 5-point Likert scale. The evaluation was strictly double-blinded; assessors were completely unaware of both the model identities and each other’s ratings. Statistical analyses included Intraclass Correlation Coefficient (ICC) for inter-rater reliability, Cronbach’s alpha for internal consistency, and Kruskal-Wallis H test for performance comparison, with significance set at p < 0.05. Results Claude 3.7 Sonnet achieved the highest overall mean score (3.79 ± 0.74), followed by Gemini 2.0 (3.75 ± 0.58) and Grok-3 (3.70 ± 0.54). ChatGPT-4o, DeepSeek-V2, and Microsoft Copilot demonstrated varying performance levels (3.40 ± 0.64, 3.68 ± 0.46, and 3.29 ± 0.53, respectively). While no statistically significant differences were observed among models for scientific accuracy ( p = 0.320), clarity ( p = 0.184), or relevance ( p = 0.608), comprehensiveness showed significant variation ( p = 0.036). Gemini 2.0 provided significantly more comprehensive responses (3.90 ± 0.66) compared to Microsoft Copilot (2.95 ± 0.55). Inter-rater reliability was good (ICC = 0.709, 95% Confidence Interval CI: 0.550–0.846, p < 0.001), as was internal consistency (Cronbach’s α = 0.709). Critically, none of the evaluated models achieved ratings in the “very good” category (≥ 4.5) across all parameters. Conclusion Contemporary LLMs demonstrate moderate-to-good proficiency in addressing prosthodontic patient education queries, with Claude 3.7 Sonnet and Gemini 2.0 currently offering the most balanced performance profiles. While scientific accuracy is comparable across platforms, significant variations exist in information comprehensiveness. Hallucinations remain an inherent risk across all models, and the absence of “very good” ratings indicates substantial limitations that necessitate professional oversight. Future assessments should include qualitative error analysis and layman evaluations to better capture patient perspectives.

Bookmark

View Full Paper