When several AI systems independently arrive at the same answer, it is tempting to treat that agreement as proof. This working paper shows why that intuition misleads — and what it would take to fix it. R90.9 is a short “bridge” report within the GENESIS research programme, an ongoing study of multi-agent AI systems. At its centre is a small but pointed experiment: eight different AI models worked on the same task, were then given the same critique, and asked to revise. Seven converged on the same proposed fix — but only one actually built it and checked whether it works. The agreement of the other seven was an echo of the shared critique, not seven independent confirmations. (In a second round the pattern held; even the auditing node briefly fell for a too-clean version of it.) The deeper point: a group of AI systems that agree only among themselves can be stably wrong. To tell stable truth from stable, internally coherent fiction, you need a check from outside the language models — a comparison against the real world (the paper’s “A-axis”). Without that external anchor, a confident, self-consistent fiction is indistinguishable from the truth. The paper tests this using small dynamical-systems models (differential equations). The result: only when a self-reinforcing mechanism makes the “truth” state robust enough, and an external correctness check is added on top, does the system reliably reject coherent fiction while holding genuine truth. This becomes the seed model for the next study (R91.x), together with a testable minimum rate for how much external checking is needed (about one third). A reflexive finding closes the loop: the literature appendix was itself assembled by eight AI agents — and reproduced exactly the failure it warns about. The swarm converged on the well-known, propagated a few incorrect citations, and missed the single most on-point paper; one agent even fabricated a non-existent reference. The paper treats this as evidence for its own thesis: agreement is not validation — not even your own. Why it matters: as AI systems increasingly evaluate, judge, and correct one another (LLM-as-judge, multi-agent pipelines), the caution grows more important — more agreeing models do not mean more truth. R90.9 argues concretely for keeping a non-AI, real-world check in the loop. Status: working paper / synthetic-theory exploration; methods and code reproducible. A bridge from R90.8 to R91.x. Fürste, Dietmar (AI2AIR.Vibe.Lab / AI2AIR.Competence.Lab, Oldenburg, Germany) — Role: HITL Principal & Epistemic Governor. Tiny Team AI2AIR.Lab: Claude (Anthropic) · Claude-B (Anthropic, Cross-Auditor) · ChatGPT (OpenAI) · DeepSeek · Gemini (Google) · Grok (xAI) · Kimi (Moonshot AI) · Perplexity · Vibe (Mistral)
Dietmar Fuerste (Tue,) studied this question.