What question did this study set out to answer?

The aim is to evaluate whether multiple AI agents provide coherent outputs and identify sources of disagreement.

June 29, 2026Open Access

Sheaf-Theoretic Consistency Measures for Multi-Agent AI: Construction and Empirical Evaluation

Key Points

The aim is to evaluate whether multiple AI agents provide coherent outputs and identify sources of disagreement.
Developed multiple sheaf-theoretic measures including a combinatorial disagreement measure and coherence defect.
Evaluated measures against graded, relational, and distributional datasets.
Performed a discrete multiple-choice task with a live panel of language models.
Measures demonstrate superior performance over baselines in tracking conflict severity and localizing disagreements.
Achieved AUC of 1.00 in identifying cyclic frustration, undetectable by pairwise statistics.
Coherence predicts wrong answers in multiple-choice tasks, but simple vote-counting performed better.

Abstract

This paper asks a simple question about modern AI systems: when you put several models or agents on a task, do their outputs actually cohere into one trustworthy answer — and if not, where, and how much? Standard tooling checks only that the components communicated, not that they agreed. We show that this is, formally, the problem sheaf theory was built for — how local pieces of information glue into a consistent global whole — and we turn that mathematics into a family of computable consistency measures for the outputs of N agents: a fast combinatorial H⁰ disagreement measure, a sheaf-Laplacian "coherence defect," a relational H¹ that detects cyclically frustrated disagreement no pairwise check can see, and a distributional contextuality test. We evaluate them honestly. On graded, relational, and distributional data the measures clearly beat cheap baselines — tracking conflict severity, respecting confidence, localizing the disagreement, and catching cyclic frustration (AUC 1.00) that is invisible to any pairwise statistic. On a discrete multiple-choice task run on a live panel of language models, coherence predicts wrong answers but does not beat simple vote-counting. We report this negative as plainly as the positives: the construction earns its keep on graded, relational, and distributional problems — not one-of-N voting. Code and data are released for reproducibility.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper