Generative large language models (LLMs) are increasingly used to support writing feedback. However, the pedagogical safety and usefulness of LLM feedback for primary students remains under-evaluated. This study reports an educator-centered evaluation of GPT-4 Turbo for Year 5 narrative and persuasive writing in the context of an established online tutoring program. Using authentic students’ drafts paired with tutor feedback, we generated parallel LLM feedback via rubric-aligned prompting and compared the two feedback sources in a blinded, within-script design. Four experienced English specialists co-designed a six-dimensional rubric (clarity, specificity, helpfulness, feasibility, relevance, and overall effectiveness) and rated tutor versus LLM feedback for each script; their written reflections were analyzed thematically to surface boundary conditions and risk perceptions. Across dimensions, tutor feedback received slightly higher mean ratings, with the clearest descriptive advantage in perceived helpfulness; however, none of the differences remained statistically significant after Holm-Bonferroni correction. LLM feedback was often rated similarly for clarity and feasibility but was frequently characterized as generic, surface-focused, and occasionally misaligned with the student draft, which increased verification effort and posed a risk of misleading learners if used without mediation. Synthesizing ratings and educator reflections, we identify conditions under which LLM feedback is most appropriate as rapid first-pass support for routine structure and surface revision, and least appropriate for developmental judgment and context-sensitive guidance. We translate these findings into design requirements for teacher-in-the-loop primary writing feedback systems, including alignment to explicit pedagogical constructs, editable workflows, and safeguards that reduce unsupported feedback before release to students.
Zhang et al. (Thu,) studied this question.