February 19, 2026Open Access

Operating characteristics of agreement metrics in AI-based scoring: a Monte Carlo simulation

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

Introduction: This study analyzed the threshold-exceedance performance of human and AI scoring agreement measures for scoring of open-ended items. Monte Carlo simulation was performed to represent the different types of errors encountered in automatic scoring determined by using studies in the literature. Accordingly, an examination was conducted of the agreement between human and artificial intelligence ratings under differing error conditions. The objective of the research was to ascertain the statistical power levels achieved by disparate agreement metrics under varying conditions. The objective of this research was to ascertain which metric would prove to be the most effective method, and under which conditions. Methods: Data with conditions including systematic additive bias, variance inflation, midpoint compression, class imbalance, and subgroup-related offsets. Human scores served as a reference to assess agreement. Agreement levels were evaluated with ICC(A,1), Krippendorff's α (ranked), quadratic weighted kappa (QWK), and Bland-Altman along with tolerance-based agreement metrics. Threshold-exceedance performance was defined as the proportion with which each metric surpassed conventional adequacy standards. Analyses were also conducted on real data to validate the analyses conducted in the second part of the study. In this part, written texts were scored by six different students using three teachers and two large language models. Results: ICC(A,1) shown higher threshold-exceedance performance for low and moderate variance inflation. QWK was observed to reach a moderate level of robustness. Krippendorff's a showed consistent performance, especially in conditions where the distributions were unbalanced or variance inflated. Tolerance-based fit demonstrated numerical closeness between human and AI scores. The findings showed patterns consistent with simulated impairments. Discussion: All findings indicate that fit indices vary systematically across different structural mechanisms and sampling conditions. The results suggest that different conditions can affect the interpretability of automated scores. Accordingly, the need for multi-metric assessment frameworks when assessing human-AI score fit is highlighted.

Me gusta

Guardar

Ver artículo completo