Recent advances in large language models have revitalized research on automated essay evaluation, yet critical concerns remain regarding their reliability, validity, and interpretability. This study presents a comparative analysis of five LLMs (GPT-4.1, LLaMA 4 Maverick, Gemini 2.5 Flash, Claude Sonnet 4, and DeepSeek R1) in the assessment of long English essays authored by non-native speakers in higher education. The analysis draws on LLM-generated scores for 60 essays to examine (a) intra-model reliability across repeated scoring runs, (b) the degree of alignment between model outputs and expert human ratings, and (c) causal feature dependencies that clarify how linguistic characteristics influence model scoring behavior. Findings reveal substantial variation: some models achieved near-perfect reproducibility and strong alignment with human raters, whereas others displayed inconsistency, score compression, or systematic underestimation. Causal discovery analysis further uncovered distinct evaluative heuristics, with most models prioritizing lexical precision and fluency, while others emphasized syntactic complexity or cross-domain integration. Collectively, these results establish model-specific reliability profiles and application contexts, providing empirical benchmarks and practical guidance for the responsible use of LLMs in educational writing assessment.
Liu et al. (Sun,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: