December 3, 2025Open Access

EvalCouncil: A Committee-Based LLM Framework for Reliable and Unbiased Automated Grading

Key Points

Automated grading with EvalCouncil enhances reliability in assessments, addressing common biases and diagnostic limits.
The framework employs a committee structure with a chief arbitrator to maintain grading consistency and accountability.
Assessment analysis utilizes a graph database to track grading provenance and measure evaluation dispersion through MPAD metrics.
The approach highlights a triage policy that clarifies grading standards for low-dispersion and high-dispersion items for better educational outcomes.

Abstract

Large Language Models (LLMs) are increasingly used for rubric-based assessment, yet reliability is limited by instability, bias, and weak diagnostics. We present EvalCouncil, a committee-and-chief framework for rubric-guided grading with auditable traces and a human adjudication baseline. Our objectives are to (i) characterize domain structure in Human–LLM alignment, (ii) assess robustness to concordance tolerance and panel composition, and (iii) derive a domain-adaptive audit policy grounded in dispersion and chief–panel differences. Authentic student responses from two domains–Computer Networks (CNs) and Machine Learning (ML)–are graded by multiple heterogeneous LLM evaluators using identical rubric prompts. A designated chief arbitrator operates within a tolerance band and issues the final grade. We quantify within-panel dispersion via MPAD (mean pairwise absolute deviation), measure chief–panel concordance (e.g., absolute error and bias), and compute Human–LLM deviation. Robustness is examined by sweeping the tolerance and performing leave-one-out perturbations of panel composition. All outputs and reasoning traces are stored in a graph database for full provenance. Human–LLM alignment exhibits systematic domain dependence: ML shows tighter central tendency and shorter upper tails, whereas CN displays broader dispersion with heavier upper tails and larger extreme spreads. Disagreement increases with item difficulty as captured by MPAD, concentrating misalignment on a relatively small subset of items. These patterns are stable to tolerance variation and single-grader removals. The signals support a practical triage policy: accept low-dispersion, small-gap items; apply a brief check to borderline cases; and adjudicate high-dispersion or large-gap items with targeted rubric clarification. EvalCouncil instantiates a committee-and-chief, rubric-guided grading workflow with committee arbitration, a human adjudication baseline, and graph-based auditability in a real classroom deployment. By linking domain-aware dispersion (MPAD), a policy tolerance dial, and chief–panel discrepancy, the study shows how these elements can be combined into a replicable, auditable, and capacity-aware approach for organizing LLM-assisted grading and identifying instability and systematic misalignment, while maintaining pedagogical interpretability.

Read Full Paperexternally

Perguntar à IA

Bookmark

View Full Paper

Cite This Study

Anghel et al. (Wed,) studied this question.

synapsesocial.com/papers/694025972d562116f28fec67 https://doi.org/https://doi.org/10.3390/computers14120530

Perguntar à IA

Bookmark

View Full Paper