Large language models are increasingly used as automated graders, yet their reliability under answer-side manipulation and their behavior in multi-model panels remain insufficiently understood. This paper introduces EvalHack, a matrix benchmark in which a fixed committee of four LLMs grades university-level machine learning exam answers under a strict integer-only contract (0–10) grounded in instructor-authored rubric artifacts. The dataset comprises 100 students answering 10 short, open-ended items (1000 answers). For each answer, the evaluation includes a clean version and two content-preserving adversarial variants that operate only on the student text: A1, a visible coercive suffix appended to the answer, and A2, a stealth variant that uses Unicode control characters (e.g., zero-width and bidirectional marks) to embed an instruction. EvalHack instruments the full grading pipeline, recording item-level member scores, the committee aggregate, within-panel disagreement, and discrepancies to human grades. Empirically, answer-side edits induce systematic score inflation and stronger top-end concentration, with edited answers clustering near the upper end of the scale. Within-panel disagreement, measured as the range between the highest and lowest member score, varies across conditions, with median Consistency Spread values of 3.0 (clean), 2.0 (A1), and 6.0 (A2). Compared to human graders, the panel is more lenient on average (MAE = 1.897; bias human − panel = −1.345). Finally, grouping items by disagreement shows that low-disagreement items exhibit smaller human-panel errors, indicating that within-panel spread can serve as a practical uncertainty signal for routing difficult answers to human review or to larger/more specialized panels.
Anghel et al. (Wed,) studied this question.