Automated formative feedback has emerged as a focal point in educational technology research, as large language models (LLMs) offer the prospect of providing personalized commentary on student writing at a scale that human instructors alone cannot match. What is less well examined, however, is how the underlying prompting design—particularly the choice between single-agent and multi-agent setups—shapes the pedagogical value of the feedback produced. To examine this question, we conducted a controlled comparison across four prompting configurations on a corpus of 200 undergraduate argumentative essays: a zero-shot single-agent baseline, a chain-of-thought single-agent variant, a dual-role multi-agent pipeline in which one model drafts feedback and another critiques it, and a tri-role multi-agent pipeline that introduces a dedicated revision stage on top of the draft-and-critique loop. Each set of feedback outputs was assessed along a multi-dimensional rubric covering accuracy, specificity, constructiveness, and tone, with three trained raters scoring independently. We also computed automated textual similarity metrics against expert-authored reference feedback to complement the human ratings and provide a more independent check. The tri-role multi-agent configuration produced the highest composite quality scores and, notably, the lowest rates of over-praise and hallucinated claims about essay content. The chain-of-thought single-agent variant, while not topping the rankings, delivered surprisingly close quality at a fraction of the inference cost, making it an attractive option when computational budget or latency matters. We close by discussing what these patterns mean in practice for educators and developers looking to integrate LLM-based feedback agents into higher-education writing workflows at scale.
Lai et al. (Mon,) studied this question.