October 28, 2025Open Access

Comparison of accuracy and consistency of AI Language models when answering standardised dental MCQs

Key Points

Key points are not available for this paper at this time.

Abstract

BACKGROUND: Artificial intelligence (AI) models have been increasingly integrated into dental education for assessment and learning support. However, their accuracy and reliability in assessment of dental knowledge requires further evaluation. OBJECTIVE: This study aimed to assess and compare the accuracy and response consistency of five AI language models- ChatGPT-4, Grok XI, Gemini, Qwen 2.5 and DeepSeek-V3- using standardised dental multiple-choice questions (MCQs). METHODS: A set of 150 MCQs from two textbooks was used. Each AI model was tested twice, 10 days apart, using identical questions. Accuracy was determined by comparing responses to reference answers, and consistency was measured using Cohen's kappa and McNemar's test. The inter-model agreement was also analysed. RESULTS: ChatGPT-4 showed the highest accuracy (91.3%) in both assessments, followed by Grok XI (90.7-92.7%) and Qwen 2.5 (89.3%). Gemini and DeepSeek performed slightly lower (86.7-88.7%). ChatGPT, Grok XI and Gemini demonstrated strong consistency, whereas Qwen 2.5 and DeepSeek exhibited more variation between test administrations. No significant differences were found in inter-model agreement (p > 0.05). CONCLUSION: All five AI models showed high levels of accuracy in answering dental MCQs, and three of the models, ChatGPT-4, Grok XI and Gemini had strong test-retest reliability. These AI models show promise as educational tools, though continued evaluation and refinement are needed for broader clinical or academic applications.

Bookmark

View Full Paper

Bookmark

View Full Paper

Comparison of accuracy and consistency of AI Language models when answering standardised dental MCQs

Key Points

Abstract

Cite This Study