July 1, 2024Open Access

Harnessing LLMs for multi-dimensional writing assessment: Reliability and alignment with human judgments

Key Points

Key points are not available for this paper at this time.

Abstract

Recent advancements in natural language processing, computational linguistics, and Artificial Intelligence (AI) have propelled the use of Large Language Models (LLMs) in Automated Essay Scoring (AES), offering efficient and unbiased writing assessment. This study assesses the reliability of LLMs in AES tasks, focusing on scoring consistency and alignment with human raters. We explore the impact of prompt engineering, temperature settings, and multi-level rating dimensions on the scoring performance of LLMs. Results indicate that prompt engineering significantly affects the reliability of LLMs, with GPT-4 showing marked improvement over GPT-3.5 and Claude 2, achieving 112% and 114% increase in scoring accuracy under the criteria and sample-referenced justification prompt. Temperature settings also influence the output consistency of LLMs, with lower temperatures producing scores more in line with human evaluations, which is essential for maintaining fairness in large-scale assessment. Regarding multi-dimensional writing assessment, results indicate that GPT-4 performs well in dimensions regarding

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Tang et al. (Mon,) studied this question.

synapsesocial.com/papers/68e61ca0b6db6435875aee1b https://doi.org/https://doi.org/10.1016/j.heliyon.2024.e34262

Bookmark

View Full Paper