Abstract Background This study aimed to evaluate and benchmark the proficiency of six major large language models (LLMs)—specifically the versions publicly available as of March 2025 (ChatGPT-4o, Claude 3.7 Sonnet, Microsoft Copilot, DeepSeek-V2, Gemini 2.0, and Grok-3)—in addressing prosthodontic patient inquiries across various clinical dimensions, including scientific accuracy, comprehensiveness, clarity, and relevance. Methods In this descriptive, cross-sectional comparative study, ten standardized patient questions encompassing key prosthodontic topics (fixed prostheses, removable prostheses, dental implants, and aesthetic restorations) were systematically posed to six current LLM platforms. Each model was assigned a prosthodontic specialist persona through standardized prompts. A total of 60 independent responses (representing unique combinations of the 10 questions and 6 models) were obtained. These were evaluated by two experienced prosthodontic specialists (one Professor with 15 + years, one Associate Professor with 8 + years of clinical and academic experience) using a validated 5-point Likert scale. The evaluation was strictly double-blinded; assessors were completely unaware of both the model identities and each other’s ratings. Statistical analyses included Intraclass Correlation Coefficient (ICC) for inter-rater reliability, Cronbach’s alpha for internal consistency, and Kruskal-Wallis H test for performance comparison, with significance set at p < 0.05. Results Claude 3.7 Sonnet achieved the highest overall mean score (3.79 ± 0.74), followed by Gemini 2.0 (3.75 ± 0.58) and Grok-3 (3.70 ± 0.54). ChatGPT-4o, DeepSeek-V2, and Microsoft Copilot demonstrated varying performance levels (3.40 ± 0.64, 3.68 ± 0.46, and 3.29 ± 0.53, respectively). While no statistically significant differences were observed among models for scientific accuracy ( p = 0.320), clarity ( p = 0.184), or relevance ( p = 0.608), comprehensiveness showed significant variation ( p = 0.036). Gemini 2.0 provided significantly more comprehensive responses (3.90 ± 0.66) compared to Microsoft Copilot (2.95 ± 0.55). Inter-rater reliability was good (ICC = 0.709, 95% Confidence Interval CI: 0.550–0.846, p < 0.001), as was internal consistency (Cronbach’s α = 0.709). Critically, none of the evaluated models achieved ratings in the “very good” category (≥ 4.5) across all parameters. Conclusion Contemporary LLMs demonstrate moderate-to-good proficiency in addressing prosthodontic patient education queries, with Claude 3.7 Sonnet and Gemini 2.0 currently offering the most balanced performance profiles. While scientific accuracy is comparable across platforms, significant variations exist in information comprehensiveness. Hallucinations remain an inherent risk across all models, and the absence of “very good” ratings indicates substantial limitations that necessitate professional oversight. Future assessments should include qualitative error analysis and layman evaluations to better capture patient perspectives.
Işık et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: