What question did this study set out to answer?

This study aims to evaluate the effectiveness and safety of LLM-generated feedback for primary writing by comparing it to tutor feedback.

June 20, 2026Open Access

Boundary Conditions for LLM-Generated Feedback in Primary Writing: An Educator-Aligned Evaluation and Design Considerations

Key Points

This study aims to evaluate the effectiveness and safety of LLM-generated feedback for primary writing by comparing it to tutor feedback.
Conducted an educator-centered evaluation of GPT-4 Turbo for Year 5 narrative and persuasive writing.
Utilized authentic student drafts and tutor feedback to generate parallel LLM feedback via rubric-aligned prompting.
Four experienced English specialists rated the feedback using a six-dimensional rubric and analyzed reflections thematically.
Tutor feedback received higher mean ratings on clarity and helpfulness, but differences were not statistically significant after correction.
LLM feedback was rated similarly for clarity and feasibility but often identified as generic and surface-focused.
Identified conditions for optimal use of LLM feedback, highlighting risks when used without mediation.

Abstract

Generative large language models (LLMs) are increasingly used to support writing feedback. However, the pedagogical safety and usefulness of LLM feedback for primary students remains under-evaluated. This study reports an educator-centered evaluation of GPT-4 Turbo for Year 5 narrative and persuasive writing in the context of an established online tutoring program. Using authentic students’ drafts paired with tutor feedback, we generated parallel LLM feedback via rubric-aligned prompting and compared the two feedback sources in a blinded, within-script design. Four experienced English specialists co-designed a six-dimensional rubric (clarity, specificity, helpfulness, feasibility, relevance, and overall effectiveness) and rated tutor versus LLM feedback for each script; their written reflections were analyzed thematically to surface boundary conditions and risk perceptions. Across dimensions, tutor feedback received slightly higher mean ratings, with the clearest descriptive advantage in perceived helpfulness; however, none of the differences remained statistically significant after Holm-Bonferroni correction. LLM feedback was often rated similarly for clarity and feasibility but was frequently characterized as generic, surface-focused, and occasionally misaligned with the student draft, which increased verification effort and posed a risk of misleading learners if used without mediation. Synthesizing ratings and educator reflections, we identify conditions under which LLM feedback is most appropriate as rapid first-pass support for routine structure and surface revision, and least appropriate for developmental judgment and context-sensitive guidance. We translate these findings into design requirements for teacher-in-the-loop primary writing feedback systems, including alignment to explicit pedagogical constructs, editable workflows, and safeguards that reduce unsupported feedback before release to students.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper