What question did this study set out to answer?

The aim is to validate multi-agent evaluation norms and assess operational redundancy between Parsimony and Internal Consistency.

June 15, 2026Open Access

Agreement Is Not Validation: A Preregistration-and-Adversarial-Audit Loop for Validating Multi-Agent Evaluation Norms

Key Points

The aim is to validate multi-agent evaluation norms and assess operational redundancy between Parsimony and Internal Consistency.
Developed a validation loop incorporating preregistration, blind elicitation, dual extraction, per-agent analysis, and adversarial audit.
Applied the loop to investigate the redundancy of evaluation norms across four modules with rigorous statistical checks.
Tested the evaluation norms using a balanced 18-model stimulus to minimize bias.
Redundancy between the two norms was shown to be primarily a stimulus construction artifact, with a Kendall-tau reduction from 0.69 to 0.04 under decorrelation.
Identical rank vectors from independent agents indicated a single strategy rather than independent observations.
No evidence of operational redundancy was found between the two norms.

Abstract

DESCRIPTIONMulti-agent and LLM-as-judge evaluation increasingly treats agreement among raters as evidence that the underlying measurement is sound. This inference is unsafe: agents drawn from the same model class share training corpora and heuristic priors, so their agreement can reflect shared bias as readily as shared access to ground truth. Agreement is a property of the raters; validity is a property of the instrument.This paper presents an operational loop for validating whether the norms of a multi-agent evaluation rubric are operationally distinct — a discriminant-validity question that agreement alone cannot answer. The loop combines five guards, each motivated by a documented failure mode: preregistration of hypotheses, thresholds, and exclusion rules (frozen and SHA-256 hash-anchored before any data); blind clean-chat elicitation; dual independent extraction with permutation checks; per-agent analysis without pooling; and independent adversarial audit by out-of-band re-implementation rather than script execution.The loop is applied to a concrete suspicion raised in earlier work: that two evaluation norms — Parsimony and Internal Consistency — are redundant. Across four escalating modules, the apparent coupling (Kendall-tau = 0.69 over full rank vectors) is shown to be substantially a stimulus-construction artifact. It halves to 0.36 under partial decorrelation and collapses to 0.04 — at the worst-case artifact ceiling — under a fully decorrelated, perfectly balanced 18-model stimulus with zero design covariance. A residual coupling survives only for a minority of agents and is traced, via a non-LLM mechanical method axis, to declared tie-break strategy rather than to norm geometry. Notably, four nominally independent “orthogonal” agents are found to contribute byte-identical rank vectors — one replicated strategy, not four corroborating observations — an empirical instance of agreement overstating evidence.The conclusion, within the tested design, is that no evidence of operational redundancy between the two norms was found; the apparent redundancy was stimulus construction plus a minority tie-break strategy. The primary contribution, however, is methodological rather than the single negative-leaning result: a reusable architecture in which the validation of the instrument is held to the same evidentiary standard as the result it produces. Every audit node in the study erred at least once and was caught by another — the architecture, not any single node, is the argument.The work is positioned against the international research landscape in evaluation-as-measurement and construct validity (Jacobs Wallach et al. 2025; Hardt 2025), the LLM-as-judge bias literature, the transfer of preregistration into machine learning, and — in deliberate tension — the multi-agent-debate literature, with which it nonetheless converges on the finding that rater independence, not rater count, is what makes aggregation trustworthy.This is workshop-track measurement methodology: a validated method plus one terminal negative-leaning finding, not a capability benchmark. Predecessor module R90.6 is summarized as the mandate that framed R90.7 as a controlled falsification round. Full preregistrations with hash chain, frozen input packages, adversarial-audit reports, reproducible scripts, and a plain-language summary for a general audience are included as appendices. A German-language version of the manuscript is published in parallel.Keywords: construct validity, discriminant validity, LLM-as-judge, multitrait-multimethod, preregistration, adversarial audit, evaluation science, rubric-based evaluation

Read Full Paperexternally

Ask AI

Mark Helpful

Bookmark

Relay

View Full Paper