What question did this study set out to answer?

Evaluate and compare the performance of AI chatbots in answering queries about oral lesions.

April 1, 2026Open Access

Performance Comparison of Artificial Intelligence Chatbots in Addressing Oral Lesion Queries: An In Silico Cross-Sectional Study

Key Points

Evaluate and compare the performance of AI chatbots in answering queries about oral lesions.
Curated twenty patient-centered questions from reputable health sources.
Entered questions into three AI chatbots under standardized conditions.
Rated responses using a five-point Likert scale by calibrated observers.
Analyzed readability with Flesch Reading Ease and Flesch–Kincaid Grade Level indices.
Google Gemini and ChatGPT outperformed Microsoft Copilot in accuracy and safety.
Significant differences in accuracy (P = 0.022) and safety (P < 0.001) were noted.
ChatGPT had the best readability (FKGL = 6.58, FRE = 59.64).
Inter-rater agreement was highest for Microsoft Copilot (κ ≈ 0.8).

Abstract

Background: Oral lesions are common clinical findings that frequently cause patient anxiety and prompt individuals to seek online information. Artificial intelligence (AI)-driven chatbots, such as ChatGPT, Google Gemini, and Microsoft Copilot, are increasingly utilized for immediate guidance; however, their reliability, accuracy, and safety in addressing oral lesion-related queries remain uncertain. Objective: This study aimed to evaluate and compare the performance of ChatGPT, Google Gemini, and Microsoft Copilot in responding to patient queries on oral lesions, with emphasis on accuracy, relevance, clarity, safety, transparency, and readability. Methods: Twenty patient-centered questions were curated from reputable health sources and public forums. Each question was entered into the three chatbots under standardized conditions. Four calibrated observers independently rated the responses using a structured five-point Likert scale. Readability was analyzed using the Flesch Reading Ease (FRE) and Flesch–Kincaid Grade Level (FKGL) indices. Results: Google Gemini and ChatGPT outperformed Microsoft Copilot, with significant differences observed in accuracy ( P = 0.022) and safety ( P < 0.001). Inter-rater agreement was highest for Copilot (κ ≈ 0.8), while ChatGPT demonstrated the best readability (FKGL = 6.58, FRE = 59.64). Conclusion: ChatGPT and Google Gemini demonstrated superior performance compared to Microsoft Copilot. While ChatGPT offered more readable responses, Gemini provided more comprehensive but complex content. Continuous refinement and domain-specific training are essential to enhance their clinical reliability and ensure patient safety.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper