What does this research mean for the field?

A tri-role multi-agent LLM configuration produces the highest quality automated formative feedback for student writing, while a chain-of-thought single-agent approach delivers comparable quality at a significantly lower computational cost. Novelty: ClaimNovelty.INCREMENTAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This research aims to investigate how different prompting designs influence the quality of automated formative feedback provided by large language models in education.

May 20, 2026Open Access

A Comparative Empirical Evaluation of Single-Agent and Multi-Agent LLM Prompting Strategies for Automated Formative Feedback in Education

Key Points

This research aims to investigate how different prompting designs influence the quality of automated formative feedback provided by large language models in education.
Conducted a controlled comparison using four prompting configurations on 200 undergraduate essays.
Evaluated feedback quality with a multi-dimensional rubric assessed by three trained raters.
Computed automated textual similarity metrics against expert-authored feedback.
The tri-role multi-agent configuration achieved the highest composite quality scores.
It also recorded the lowest rates of over-praise and hallucinated claims about essay content.
The chain-of-thought single-agent variant delivered high quality feedback at a lower inference cost.

Abstract

Automated formative feedback has emerged as a focal point in educational technology research, as large language models (LLMs) offer the prospect of providing personalized commentary on student writing at a scale that human instructors alone cannot match. What is less well examined, however, is how the underlying prompting design—particularly the choice between single-agent and multi-agent setups—shapes the pedagogical value of the feedback produced. To examine this question, we conducted a controlled comparison across four prompting configurations on a corpus of 200 undergraduate argumentative essays: a zero-shot single-agent baseline, a chain-of-thought single-agent variant, a dual-role multi-agent pipeline in which one model drafts feedback and another critiques it, and a tri-role multi-agent pipeline that introduces a dedicated revision stage on top of the draft-and-critique loop. Each set of feedback outputs was assessed along a multi-dimensional rubric covering accuracy, specificity, constructiveness, and tone, with three trained raters scoring independently. We also computed automated textual similarity metrics against expert-authored reference feedback to complement the human ratings and provide a more independent check. The tri-role multi-agent configuration produced the highest composite quality scores and, notably, the lowest rates of over-praise and hallucinated claims about essay content. The chain-of-thought single-agent variant, while not topping the rankings, delivered surprisingly close quality at a fraction of the inference cost, making it an attractive option when computational budget or latency matters. We close by discussing what these patterns mean in practice for educators and developers looking to integrate LLM-based feedback agents into higher-education writing workflows at scale.

Mark Helpful

Bookmark

Relay

View Full Paper