Key points are not available for this paper at this time.
ABSTRACT Evaluating open‐ended questions is a common and time‐consuming task in education. With the continuous advances in natural language processing (NLP), large language models (LLMs) trained on massive datasets can assist in this process. This study evaluates the use of LLMs, complemented by retrieval‐augmented generation (RAG), for the numerical grading of open‐ended answers of approximately 250 words. We focus on two Spanish‐language technical courses and assess general‐purpose LLMs. Our results show that RAG improves grading accuracy, achieving reductions in mean absolute error (MAE) of up to 19.47% compared to using LLMs alone, with the best configuration reaching a MAE of 1.19. We also note that LLMs tend to assign high grades, reflecting the dataset's imbalance toward higher scores. This work demonstrates the potential of combining RAG with general‐purpose LLMs to evaluate specialised Spanish language content, avoiding the cost and bias of model fine‐tuning.
Fernández‐García et al. (Fri,) studied this question.