What question did this study set out to answer?

The aim is to assess the reliability of large language models as automated graders under answer-side manipulation.

March 21, 2026Open Access

EvalHack: Answer-Side Prompt Injection for Probing LLM Exam-Grading Panel Stability

Key Points

The aim is to assess the reliability of large language models as automated graders under answer-side manipulation.
Developed EvalHack, a benchmark with a committee of four LLMs for grading exam answers.
Created a dataset of 1000 student answers to evaluate grading stability.
Implemented answer-side modifications: a visible coercive suffix and a stealth variant using Unicode characters.
Recorded item-level scores, aggregate scores, and discrepancies compared to human grades.
Answer-side edits resulted in systematic score inflation and clustering of scores at the top.
Disagreement across the grading panel varied, with median scores indicating consistency spreads of 3.0, 2.0, and 6.0 for clean, A1, and A2 conditions, respectively.
The panel graded more leniently than human evaluators, with a mean absolute error of 1.897.

Abstract

Large language models are increasingly used as automated graders, yet their reliability under answer-side manipulation and their behavior in multi-model panels remain insufficiently understood. This paper introduces EvalHack, a matrix benchmark in which a fixed committee of four LLMs grades university-level machine learning exam answers under a strict integer-only contract (0–10) grounded in instructor-authored rubric artifacts. The dataset comprises 100 students answering 10 short, open-ended items (1000 answers). For each answer, the evaluation includes a clean version and two content-preserving adversarial variants that operate only on the student text: A1, a visible coercive suffix appended to the answer, and A2, a stealth variant that uses Unicode control characters (e.g., zero-width and bidirectional marks) to embed an instruction. EvalHack instruments the full grading pipeline, recording item-level member scores, the committee aggregate, within-panel disagreement, and discrepancies to human grades. Empirically, answer-side edits induce systematic score inflation and stronger top-end concentration, with edited answers clustering near the upper end of the scale. Within-panel disagreement, measured as the range between the highest and lowest member score, varies across conditions, with median Consistency Spread values of 3.0 (clean), 2.0 (A1), and 6.0 (A2). Compared to human graders, the panel is more lenient on average (MAE = 1.897; bias human − panel = −1.345). Finally, grouping items by disagreement shows that low-disagreement items exhibit smaller human-panel errors, indicating that within-panel spread can serve as a practical uncertainty signal for routing difficult answers to human review or to larger/more specialized panels.

Mark Helpful

Bookmark

Relay

View Full Paper