What question did this study set out to answer?

This paper examines the reliability of agreement among AI systems in determining truth, revealing significant pitfalls in such a process.

June 26, 2026Open Access

GENESIS R90.9 — Ground Truth between Emergence and Externality

Key Points

This paper examines the reliability of agreement among AI systems in determining truth, revealing significant pitfalls in such a process.
Conducted an experiment with eight AI models tackling the same task and providing critiques.
Assessed the models' convergence on proposed fixes and their effectiveness after revision.
Utilized small dynamical-systems models (differential equations) to test the necessity of external correctness checks.
Seven AI models proposed the same fix, but only one verified its effectiveness, illustrating that agreement alone does not indicate accuracy.
Demonstrated that a robust 'truth' state requires external verification beyond AI model consensus to avoid coherent fiction being mistaken for actual truth.
Identified that even AI-generated literature can propagate errors and misinformation, reinforcing the need for external checks.

Abstract

When several AI systems independently arrive at the same answer, it is tempting to treat that agreement as proof. This working paper shows why that intuition misleads — and what it would take to fix it. R90.9 is a short “bridge” report within the GENESIS research programme, an ongoing study of multi-agent AI systems. At its centre is a small but pointed experiment: eight different AI models worked on the same task, were then given the same critique, and asked to revise. Seven converged on the same proposed fix — but only one actually built it and checked whether it works. The agreement of the other seven was an echo of the shared critique, not seven independent confirmations. (In a second round the pattern held; even the auditing node briefly fell for a too-clean version of it.) The deeper point: a group of AI systems that agree only among themselves can be stably wrong. To tell stable truth from stable, internally coherent fiction, you need a check from outside the language models — a comparison against the real world (the paper’s “A-axis”). Without that external anchor, a confident, self-consistent fiction is indistinguishable from the truth. The paper tests this using small dynamical-systems models (differential equations). The result: only when a self-reinforcing mechanism makes the “truth” state robust enough, and an external correctness check is added on top, does the system reliably reject coherent fiction while holding genuine truth. This becomes the seed model for the next study (R91.x), together with a testable minimum rate for how much external checking is needed (about one third). A reflexive finding closes the loop: the literature appendix was itself assembled by eight AI agents — and reproduced exactly the failure it warns about. The swarm converged on the well-known, propagated a few incorrect citations, and missed the single most on-point paper; one agent even fabricated a non-existent reference. The paper treats this as evidence for its own thesis: agreement is not validation — not even your own. Why it matters: as AI systems increasingly evaluate, judge, and correct one another (LLM-as-judge, multi-agent pipelines), the caution grows more important — more agreeing models do not mean more truth. R90.9 argues concretely for keeping a non-AI, real-world check in the loop. Status: working paper / synthetic-theory exploration; methods and code reproducible. A bridge from R90.8 to R91.x. Fürste, Dietmar (AI2AIR.Vibe.Lab / AI2AIR.Competence.Lab, Oldenburg, Germany) — Role: HITL Principal & Epistemic Governor. Tiny Team AI2AIR.Lab: Claude (Anthropic) · Claude-B (Anthropic, Cross-Auditor) · ChatGPT (OpenAI) · DeepSeek · Gemini (Google) · Grok (xAI) · Kimi (Moonshot AI) · Perplexity · Vibe (Mistral)

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper