What question did this study set out to answer?

The aim is to evaluate the reliability and alignment of various large language models in assessing non-native English essays.

March 10, 2026Open Access

A Framework for Evaluation of Large Language Models in Essay Assessment: Reliability, Alignment, and Causal Reasoning

Key Points

The aim is to evaluate the reliability and alignment of various large language models in assessing non-native English essays.
Comparative analysis of five large language models (LLMs) in scoring essays.
Assessment of intra-model reliability through repeated scoring runs.
Evaluation of model outputs against expert human ratings.
Causal feature analysis to explore linguistic influences on scoring behavior.
Some models showed near-perfect reproducibility and alignment with human raters.
Variability among models included inconsistencies, score compression, and systematic underestimation.
Causal analysis identified different evaluative heuristics used by models, focusing on aspects like lexical precision and fluency.

Abstract

Recent advances in large language models have revitalized research on automated essay evaluation, yet critical concerns remain regarding their reliability, validity, and interpretability. This study presents a comparative analysis of five LLMs (GPT-4.1, LLaMA 4 Maverick, Gemini 2.5 Flash, Claude Sonnet 4, and DeepSeek R1) in the assessment of long English essays authored by non-native speakers in higher education. The analysis draws on LLM-generated scores for 60 essays to examine (a) intra-model reliability across repeated scoring runs, (b) the degree of alignment between model outputs and expert human ratings, and (c) causal feature dependencies that clarify how linguistic characteristics influence model scoring behavior. Findings reveal substantial variation: some models achieved near-perfect reproducibility and strong alignment with human raters, whereas others displayed inconsistency, score compression, or systematic underestimation. Causal discovery analysis further uncovered distinct evaluative heuristics, with most models prioritizing lexical precision and fluency, while others emphasized syntactic complexity or cross-domain integration. Collectively, these results establish model-specific reliability profiles and application contexts, providing empirical benchmarks and practical guidance for the responsible use of LLMs in educational writing assessment.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper