What question did this study set out to answer?

This study aims to explore the dynamics of learning and bias in a multi-agent LLM scoring system and introduce mechanisms for optimal memory consolidation.

July 1, 2026Open Access

When Disagreement Means Learning (or Bias): Human-Governed Memory Consolidation and Counterfactual Diagnosis in Multi-Agent LLM Scoring

Key Points

This study aims to explore the dynamics of learning and bias in a multi-agent LLM scoring system and introduce mechanisms for optimal memory consolidation.
Developed the Mizaan scoring system evaluated across software-engineering and real estate domains.
Implemented human corrections as instance-level retrieved exemplars and measured their effectiveness.
Conducted evaluations over three independent runs without fine-tuning.
Instance-level learning resulted in a 0.20 separation (+0.33 vs. +0.12) from unrelated controls, addressing bias.
Agents agreed on 88.3 ± 0.5% of items throughout evaluations.
Counterfactual cold scoring identified auditor offsets, helping to clarify disagreement causes between agents.

Abstract

Multi-agent validation is a natural defence against the unreliability of a single LLM judge: asecond, independent agent scores the same item, and disagreement routes the case to a human.But this design breaks down exactly when the system starts to learn. If the primary scoreradapts to human corrections through retrieval-augmented exemplars while the auditor stays acold skeptic, every freshly learned lesson re-creates the disagreement it was meant to resolve,and the system escalates precisely the cases it just learned to handle. We present Mizaan, aproduction-grade multi-agent scoring system, and three mechanisms that address this tension,evaluated across two contrasting domains (software-engineering work items and real-estatelistings) over three independent runs with no fine-tuning anywhere in the loop. Our central result is human-governed memory consolidation. Human corrections act first asinstance-level retrieved exemplars seen only by the primary; when corrections teaching the samelesson cluster in embedding space, an LLM drafts the lesson as a rubric amendment, a humanapproves it, both agents then inherit it, and the promoted exemplars are retired. We find thatinstance-level learning lifts similar held-out items above unrelated controls — a 0.20 separation(+0.33 vs. +0.12), which we take as the learning measure rather than the absolute lift — butleaks onto same-domain items it should not affect (+0.35); on the domain where correctionscluster, consolidation eliminates this leakage — returning controls to baseline while preservingthe learned lift — so the policy channel is measurably more precise than the instance channelthere. Second, inter-agent disagreement is a reproducible signal: the agents agree on 88.3 ± 0.5%of items across runs. Third, counterfactual cold scoring — scoring each item with and withoutretrieved corrections — serves as a routing signal, but its primary value at scale is as a diagnostic:it exposes a systematic auditor offset that warm-gap routing would mask, and we characterizeprecisely when inter-agent disagreement can be attributed to learning rather than error. Adeterministic spot-check canary surfaces deliberately injected bad lessons; we characterize thesample sizes a usable overturn rate would require. Mizaan’s inter-agent agreement replicatesacross two further base models including a cross-vendor one, and its learning separation stayspositive throughout, though its magnitude — like the absolute score shifts — varies with thebase model; these supplementary runs cover instance learning only, so consolidation robustnessacross base models remains future work. Throughout, learning is immediate, inspectable, andrequires no gradient update, GPU, or redeployment.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper