Artificial intelligence systems now mediate an estimated 2.8 billion daily human interactions, making reliability a matter of societal infrastructure. Yet the mechanism governing reliability above the current frontier—approximately 93% single-agent accuracy on complex reasoning tasks—remains uncharacterised. Here we report a three-study empirical programme spanning 7,950 individually scored data points, 5,004 evaluation sessions, and six frontier AI models across three fundamentally distinct problem domains. The governing variable is error correlation ρ̂ between ensemble components, measurable from observed performance. Compute scaling yields ρ̂ = 0.80; architectural role-separation (Generator–Auditor–Adversary–Synthesizer; GAAS) reduces this to ρ̂ = 0.19—a four-fold improvement (77% reduction in error correlation) on identical compute. A calibration inversion in indeterminate domains—the most accurate model simultaneously achieves second-worst confidence-interval calibration—demonstrates that intelligence and reliability are empirically orthogonal. The GAAS framework and a ρ̂ estimator constitute a deployable architectural specification for high-reliability AI.
Pandit et al. (Tue,) studied this question.