May 29, 2026Open Access

Evaluating Open-Ended High-Stakes Examinations with LLMs: Alignment Between ChatGPT-4o and Human Grading in High- and Low-Resource Languages

Key Points

Key points are not available for this paper at this time.

Abstract

Abstract Large language models (LLMs) are increasingly used to grade written responses, yet large-scale benchmarks against human expert evaluation remain scarce, especially across languages with differing resource levels. This study evaluates ChatGPT-4o using a reranked retrieval-augmented generation framework to grade Finland’s national high-stakes matriculation examination based on 1,016 students’ open-ended responses. We examined GPT-4o’s agreement with official grades, its recognition of grading-relevant keywords, and the effect of translated responses from a low-resource language (Finnish) into a high-resource language (HRL) (English). Using descriptive statistics and correlation analyses, the results show that GPT-4o’s grades on a 0–15 scale closely matched human expert evaluations; 75.00% of scores were within ±2 points of official grades, with only 3.00% being severe outliers. The translated responses into English improved this alignment to 85.00%. While the model generally identified relevant keywords effectively, occasional misinterpretations of contextual usage reduced grading reliability in a few cases. Overall, the findings demonstrate both the promising and current limitations of LLM-based assessment. There is a significant potential to use LLMs as supplementary grading tools, particularly in HRLs, but they do not yet match the consistency or interpretative depth of human expert evaluators. The study illustrates the need for human oversight, rigorous validation, and careful consideration of language effects when deploying LLMs in high-stakes educational assessments.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Cite this study

Jauhiainen et al. (Fri,) studied this question.

synapsesocial.com/papers/6a1ab6688198c9a8aa460697 — DOI: https://doi.org/10.1007/s44366-026-0091-1

Authors

Jussi S. Jauhiainen

University of Turku

Agustín Garagorry Guerra

University of Turku

Journals

Frontiers of digital education.

Actions

Institutions

University of Turku

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Evaluating Open-Ended High-Stakes Examinations with LLMs: Alignment Between ChatGPT-4o and Human Grading in High- and Low-Resource Languages

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Cite this study

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion