Building on cognitive theories, this study investigates interrater reliability (IRR) and interrater agreement (IRA) between human raters and artificial intelligence (AI) models (ChatGPT-4 and Gemini) in scoring open-ended (constructed response) and single best answer questions. The study involved 73 third-year undergraduate students from the Department of Primary School Education, enrolled in the Measurement and Evaluation course. Of these, 46 students answered the open-ended question, and 69 answered the single best answer question. The analysis focused on comparing IRR and IRA between human raters and AI models, as well as between the AI models themselves. Results indicated that AI models demonstrated higher IRR and IRA than human raters for the open-ended question, characterized by less structured responses and multidimensional scoring criteria, while human raters performed slightly better for the single best answer question. AI models showed relatively consistent scoring performance across task types, whereas human raters exhibited greater variability, particularly for the open-ended task. These findings highlight task-dependent differences in scoring consistency that are interpretable through cognitive theories, contributing to ongoing debates about the appropriate role of AI in educational assessment.
Güvendir et al. (Tue,) studied this question.