This preprint reports results from 4,680 controlled evaluations across six frontier AI models, four experimental scenarios, six mathematical reasoning domains, and 90 contamination-minimised, formally verifiable problems. The central finding is an architectural phase transition in AI reliability: above approximately 95% accuracy, error structure undergoes a qualitative shift from stochastic to systematic, at which point compute scaling becomes fundamentally ineffective. The Generator–Auditor–Adversary–Synthesizer (GAAS) architecture breaks the single-agent ceiling: single-agent inference plateaus at 93.0%; self-consistency scaling yields only +1.5 pp (p=0.317, NS); role-separated GAAS achieves 98.7% (p<0.001); role-specialised model diversity achieves 100% on this evaluation set (Wilson CI: 95.9–100%). This is Empirical Study 1 of a planned three-study programme. Study 1 covers determinate domains (mathematical and logical reasoning). Studies 2 and 3 will extend to semi-determinate and indeterminate domains. Files included:- Main manuscript (PDF)- Online Appendix 1: All 90 evaluation questions, verified ground-truth answers, complete raw model responses for S1–S4, error taxonomy classifications, and experimental protocols (DOCX)
Building similarity graph...
Analyzing shared references across papers
Loading...
Kuldeep Kumar Pandit
Vatsala Pandit
Aayan Pandit
Building similarity graph...
Analyzing shared references across papers
Loading...
Pandit et al. (Sun,) studied this question.
www.synapsesocial.com/papers/69b8f11edeb47d591b8c5ec4 — DOI: https://doi.org/10.5281/zenodo.19036412