What question did this study set out to answer?

This study aims to evaluate the performance of large language models in assessing endodontic case difficulty using established criteria.

April 13, 2026Open Access

Evaluating large language models for endodontic case difficulty assessment: accuracy, consistency, temporal stability, and agreement

Puntos clave

This study aims to evaluate the performance of large language models in assessing endodontic case difficulty using established criteria.
Developed thirty standardized endodontic cases categorized by difficulty levels (low, moderate, high).
Presented cases in three linguistically distinct but clinically equivalent formats with 90 test items total.
Queried two LLMs independently 30 times per item, accumulating 5,400 classifications for analysis.
Assessed accuracy against expert-derived ground truth while checking for consistency and temporal stability.
Evaluated inter-model agreement using Cohen’s kappa statistics.
Both models showed a performance pattern based on case difficulty.
High accuracy and consistency were observed in low-difficulty cases.
Moderate-difficulty cases had the lowest accuracy and highest variability.
High-difficulty cases demonstrated the greatest stability across responses despite varying accuracies.
Inter-model agreement was substantial, mainly showing disagreements in adjacent difficulty categories.

Resumen

Large language models (LLMs) are increasingly explored as clinical decision-support tools in dentistry; however, their reliability in endodontic decision-making remains uncertain. Most previous studies have relied on single-response accuracy, providing limited insight into the reproducibility and stability of LLM outputs across repeated evaluations. This study aimed to evaluate LLM performance in endodontic case difficulty assessment using the American Association of Endodontists (AAE) framework, with emphasis on accuracy, consistency, temporal stability, and inter-model agreement. Thirty standardised endodontic cases (10 low, 10 moderate, 10 high difficulty) were developed according to AAE criteria and presented in three linguistically distinct but clinically equivalent formats (n = 90 test items). Two LLMs were queried independently 30 times per item, yielding 5,400 classifications in a repeated-measures design. Accuracy was defined as agreement with expert-derived ground truth. Consistency and temporal stability were assessed across repeated queries, and inter-model agreement was evaluated using Cohen’s kappa statistics. Both models demonstrated a clear difficulty-dependent performance pattern. Accuracy and consistency were highest in low-difficulty cases, whereas moderate-difficulty cases showed the lowest accuracy and greatest variability. High-difficulty cases exhibited the highest response stability despite differences in accuracy between models. Linguistic variation had minimal impact on performance. Inter-model agreement was substantial, with disagreements largely confined to adjacent difficulty categories. LLM performance in endodontic case difficulty assessment is influenced by the clarity of clinical features more than case complexity alone. Moderate-difficulty cases represent a critical zone of decision instability, where reliance on single-response accuracy may be misleading. These findings support the need for uncertainty-aware evaluation frameworks and reinforce the role of LLMs as adjunctive clinical decision-support tools.

Me gusta

Guardar

Ver artículo completo