What question did this study set out to answer?

This study aims to evaluate how well rubric-guided large language models score compared to human grading in higher education.

June 14, 2026Open Access

Evaluating the Reliability and Agreement of Rubric-Guided LLM Scoring Versus Human Grading Across Three University Courses

Key Points

This study aims to evaluate how well rubric-guided large language models score compared to human grading in higher education.
Analyzed 930 student responses from three courses using a five-criterion rubric.
Scores were compared between two human raters and a large language model (ChatGPT).
Utilized multiple statistical methods including ICC, Pearson correlations, and Bland–Altman analysis for agreement assessment.
Human consensus showed strong agreement (ICC = 0.819).
AI–H1 agreement had an ICC of 0.700, while AI–H2 scored ICC = 0.767 and AI–HC agreement was ICC = 0.763.
Calibration improved Total-score ICC from 0.774 to 0.782, with MAE reduction from 1.624 to 1.215.

Abstract

Grading open-ended student work consistently remains a persistent challenge in higher education, and the recent rise of large language models (LLMs) has renewed interest in rubric-guided automated scoring. However, a key gap remains: most studies report correlation rather than agreement, rarely benchmark models against a local human–human baseline, and seldom test whether simple post hoc calibration improves operational fit. This study addresses that gap by examining whether a rubric-guided LLM can approximate local human grading practice for text-based responses in three university courses, using agreement-oriented rather than correlation-only evidence. A total of 930 student responses from Prompt Engineering, Photoshop Design, and AI Video Production were scored by two human raters and by ChatGPT using the same five-criterion analytic rubric (Accuracy, Logical Flow, Specificity, Quality, and Originality; 0.0–3.0 each; Total 0–15). Human consensus (HC) was defined as the mean of the two human scores and was treated as a pragmatic reference rather than a ground truth. Pairwise agreement among H1, H2, AI, and HC was evaluated using ICC(3,1), Pearson correlations, mean absolute error (MAE), Bland–Altman bias and limits of agreement (LoA); a course-specific held-out calibration analysis was additionally conducted. For the Total score, human–human agreement was strong (ICC = 0.819 0.797, 0.839). AI–H1 and AI–H2 Total-score agreement were ICC = 0.700 0.666, 0.732 and 0.767 0.739, 0.792, respectively, while AI–HC agreement was ICC = 0.763 0.735, 0.789, with MAE = 1.603 and LoA = −4.246, 4.045. At the trait level, AI–HC ICCs exceeded H1–H2 ICCs for all five rubric dimensions, although Quality remained weakly defined in the human baseline. On a 70/30 held-out test split, a course-specific linear calibration modestly improved Total-score ICC from 0.774 to 0.782 and reduced MAE from 1.624 to 1.215, narrowing the LoA from −4.290, 4.188 to −3.157, 3.329. However, threshold-adjacent agreement remained imperfect after calibration. The principal contribution is a conservative, multi-metric agreement benchmark of rubric-guided LLM scoring against a local human baseline, together with a held-out calibration test that informs deployment. The findings concern written responses only and support a conservative conclusion: rubric-guided LLM scoring can assist human grading under fixed local rubrics, but the current evidence supports calibrated human–AI co-grading rather than unsupervised replacement.

Evaluating the Reliability and Agreement of Rubric-Guided LLM Scoring Versus Human Grading Across Three University Courses

Key Points

Abstract

Cite This Study

Also Consider

Also Consider