Large Language Models (LLMs) are increasingly used for rubric-based assessment, yet reliability is limited by instability, bias, and weak diagnostics. We present EvalCouncil, a committee-and-chief framework for rubric-guided grading with auditable traces and a human adjudication baseline. Our objectives are to (i) characterize domain structure in Human–LLM alignment, (ii) assess robustness to concordance tolerance and panel composition, and (iii) derive a domain-adaptive audit policy grounded in dispersion and chief–panel differences. Authentic student responses from two domains–Computer Networks (CNs) and Machine Learning (ML)–are graded by multiple heterogeneous LLM evaluators using identical rubric prompts. A designated chief arbitrator operates within a tolerance band and issues the final grade. We quantify within-panel dispersion via MPAD (mean pairwise absolute deviation), measure chief–panel concordance (e.g., absolute error and bias), and compute Human–LLM deviation. Robustness is examined by sweeping the tolerance and performing leave-one-out perturbations of panel composition. All outputs and reasoning traces are stored in a graph database for full provenance. Human–LLM alignment exhibits systematic domain dependence: ML shows tighter central tendency and shorter upper tails, whereas CN displays broader dispersion with heavier upper tails and larger extreme spreads. Disagreement increases with item difficulty as captured by MPAD, concentrating misalignment on a relatively small subset of items. These patterns are stable to tolerance variation and single-grader removals. The signals support a practical triage policy: accept low-dispersion, small-gap items; apply a brief check to borderline cases; and adjudicate high-dispersion or large-gap items with targeted rubric clarification. EvalCouncil instantiates a committee-and-chief, rubric-guided grading workflow with committee arbitration, a human adjudication baseline, and graph-based auditability in a real classroom deployment. By linking domain-aware dispersion (MPAD), a policy tolerance dial, and chief–panel discrepancy, the study shows how these elements can be combined into a replicable, auditable, and capacity-aware approach for organizing LLM-assisted grading and identifying instability and systematic misalignment, while maintaining pedagogical interpretability.
Anghel et al. (Wed,) studied this question.