Large language models (LLMs) are increasingly explored as clinical decision-support tools in dentistry; however, their reliability in endodontic decision-making remains uncertain. Most previous studies have relied on single-response accuracy, providing limited insight into the reproducibility and stability of LLM outputs across repeated evaluations. This study aimed to evaluate LLM performance in endodontic case difficulty assessment using the American Association of Endodontists (AAE) framework, with emphasis on accuracy, consistency, temporal stability, and inter-model agreement. Thirty standardised endodontic cases (10 low, 10 moderate, 10 high difficulty) were developed according to AAE criteria and presented in three linguistically distinct but clinically equivalent formats (n = 90 test items). Two LLMs were queried independently 30 times per item, yielding 5,400 classifications in a repeated-measures design. Accuracy was defined as agreement with expert-derived ground truth. Consistency and temporal stability were assessed across repeated queries, and inter-model agreement was evaluated using Cohen’s kappa statistics. Both models demonstrated a clear difficulty-dependent performance pattern. Accuracy and consistency were highest in low-difficulty cases, whereas moderate-difficulty cases showed the lowest accuracy and greatest variability. High-difficulty cases exhibited the highest response stability despite differences in accuracy between models. Linguistic variation had minimal impact on performance. Inter-model agreement was substantial, with disagreements largely confined to adjacent difficulty categories. LLM performance in endodontic case difficulty assessment is influenced by the clarity of clinical features more than case complexity alone. Moderate-difficulty cases represent a critical zone of decision instability, where reliance on single-response accuracy may be misleading. These findings support the need for uncertainty-aware evaluation frameworks and reinforce the role of LLMs as adjunctive clinical decision-support tools.
Çatmabacak et al. (Sat,) studied this question.