v2: Minor formatting corrections to manuscript file. Reference list reordered to citation sequence. Figure 1 updated to vector format. No changes to data, results, or scientific content. AI systems achieve 93% accuracy on formal reasoning benchmarks, yet the fastest-growing uses of AI concern indeterminate futures: market movements, sports outcomes, medical symptoms. Here we show—across three epistemically distinct problem domains and 7,950 individually scored data points using six frontier language models evaluated as research subjects—that AI reliability degrades sharply and predictably as question type shifts from formal to indeterminate. The governing variable is ρ̂, pairwise error correlation between ensemble components. Compute scaling yields ρ̂ = 0.80; GAAS role-separation (Generator–Auditor–Adversary–Synthesizer) reduces this to ρ̂ = 0.19—a four-fold improvement on identical compute. Formal-domain accuracy reaches 93.0% single-agent and 98.7% with GAAS architecture. Semi-determinate expert synthesis falls to 79.2%, with causal-hierarchy errors detected in 13 of 18 cross-domain evaluations. Indeterminate futures reach only 66.0%, with a calibration inversion: the highest point-estimate accuracy model (Gemini: 5.3% mean error) simultaneously achieves the second-lowest confidence-interval calibration (29.7% hit rate against a 90% target—a −60 pp gap). Intelligence and reliability are empirically dissociated in this dataset precisely where AI is deployed most consequentially. Architectural role-separation, not compute scaling, is the mechanism that bridges the gap.
Building similarity graph...
Analyzing shared references across papers
Loading...
Kuldeep Kumar Pandit
vatsala Pandit
Aayan Pandit
Building similarity graph...
Analyzing shared references across papers
Loading...
Pandit et al. (Mon,) studied this question.
www.synapsesocial.com/papers/69ba44654e9516ffd37a60dc — DOI: https://doi.org/10.5281/zenodo.19051035