What question did this study set out to answer?

The study aims to evaluate the accuracy of AI language models in delivering information about impacted teeth.

January 22, 2026

Clinical Accuracy of AI Language Models in Providing Impacted Teeth Information: A Comparative Evaluation

Key Points

The study aims to evaluate the accuracy of AI language models in delivering information about impacted teeth.
Conducted a comparative study using three AI models: ChatGPT-4, Gemini, and Copilot.
Posed 118 expert-generated open-ended questions to each AI model.
Categorized responses into five accuracy levels.
Applied statistical analysis using Pearson χ 2 or Fisher exact test to assess accuracy.
ChatGPT-4 produced 83.9% objectively true responses, outperforming both Gemini and Copilot.
Gemini and Copilot provided more incomplete or selectively accurate responses classified as "Selected Facts" or "Minimal Facts."
Overall, ChatGPT-4 was identified as the more reliable resource for impacted tooth-related information.

Abstract

Artificial intelligence (AI) language models are increasingly integrated into clinical and patient-centered information pathways, yet their accuracy in delivering condition-specific dental knowledge remains unclear. This comparative study evaluated the clinical accuracy of 3 widely used AI models—ChatGPT-4, Gemini, and Copilot—in providing information on impacted teeth. A total of 118 expert-generated open-ended questions were posed to each model, and responses were categorized into 5 predefined accuracy levels. Statistical analysis using the Pearson χ 2 or Fisher exact test ( P ≤0.05) demonstrated that ChatGPT-4 produced the highest proportion of “Objectively True” responses (83.9%) and consistently outperformed Gemini and Copilot across all domains, including definitions, indications, procedural descriptions, contraindications, and complications. Gemini and Copilot more frequently generated incomplete or selectively accurate answers classified as “Selected Facts” or “Minimal Facts,” highlighting variability in their informational reliability. Overall, ChatGPT-4 exhibited superior clinical accuracy and appears to function as a more dependable supplementary resource for impacted tooth–related information, whereas the inconsistent performance of Gemini and Copilot underscores the continued need for expert oversight in patient education and clinical communication.

Mark Helpful

Bookmark

Relay