What question did this study set out to answer?

The research aims to compare interrater reliability and agreement between human raters and AI models in educational assessments.

March 21, 2026Open Access

Bridging minds and machines: a comparative study of AI and human rater agreement and reliability in educational assessment

Key Points

The research aims to compare interrater reliability and agreement between human raters and AI models in educational assessments.
Compared scoring of open-ended and single best answer questions by human raters and AI models.
Involved 73 undergraduate students in a Measurement and Evaluation course.
Analyzed interrater reliability and agreement between human raters and AI models.
AI models had higher interrater reliability and agreement for open-ended questions than human raters.
Human raters scored slightly better on single best answer questions compared to AI models.
AI models showed consistent scoring across task types, while human raters had greater variability, especially in open-ended tasks.

Abstract

Building on cognitive theories, this study investigates interrater reliability (IRR) and interrater agreement (IRA) between human raters and artificial intelligence (AI) models (ChatGPT-4 and Gemini) in scoring open-ended (constructed response) and single best answer questions. The study involved 73 third-year undergraduate students from the Department of Primary School Education, enrolled in the Measurement and Evaluation course. Of these, 46 students answered the open-ended question, and 69 answered the single best answer question. The analysis focused on comparing IRR and IRA between human raters and AI models, as well as between the AI models themselves. Results indicated that AI models demonstrated higher IRR and IRA than human raters for the open-ended question, characterized by less structured responses and multidimensional scoring criteria, while human raters performed slightly better for the single best answer question. AI models showed relatively consistent scoring performance across task types, whereas human raters exhibited greater variability, particularly for the open-ended task. These findings highlight task-dependent differences in scoring consistency that are interpretable through cognitive theories, contributing to ongoing debates about the appropriate role of AI in educational assessment.

KI fragen

Bookmark

View Full Paper

Cite This Study

Güvendir et al. (Tue,) studied this question.

synapsesocial.com/papers/69be34f26e48c4981c6731b9 https://doi.org/https://doi.org/10.1007/s10639-026-13949-7

KI fragen

Bookmark

View Full Paper