This paper asks a simple question about modern AI systems: when you put several models or agents on a task, do their outputs actually cohere into one trustworthy answer — and if not, where, and how much? Standard tooling checks only that the components communicated, not that they agreed. We show that this is, formally, the problem sheaf theory was built for — how local pieces of information glue into a consistent global whole — and we turn that mathematics into a family of computable consistency measures for the outputs of N agents: a fast combinatorial H⁰ disagreement measure, a sheaf-Laplacian "coherence defect," a relational H¹ that detects cyclically frustrated disagreement no pairwise check can see, and a distributional contextuality test. We evaluate them honestly. On graded, relational, and distributional data the measures clearly beat cheap baselines — tracking conflict severity, respecting confidence, localizing the disagreement, and catching cyclic frustration (AUC 1.00) that is invisible to any pairwise statistic. On a discrete multiple-choice task run on a live panel of language models, coherence predicts wrong answers but does not beat simple vote-counting. We report this negative as plainly as the positives: the construction earns its keep on graded, relational, and distributional problems — not one-of-N voting. Code and data are released for reproducibility.
Jack Widman (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: