What question did this study set out to answer?

This study investigates the impact of architectural design on AI reliability, particularly through phase transitions in error types.

March 17, 2026Open Access

AI Reliability Programme — Empirical Study 1: Architectural Phase Transition Governs AI Reliability Beyond the Single-Agent Ceiling (Pandit et al., 2026)

Puntos clave

This study investigates the impact of architectural design on AI reliability, particularly through phase transitions in error types.
Conducted 4,680 evaluations across six AI models and four experimental scenarios.
Focused on mathematical reasoning domains with 90 verified problems.
Analyzed the performance of the GAAS architecture against traditional single-agent models.
Identified a crucial phase transition in AI reliability at approximately 95% accuracy.
GAAS architecture achieved an accuracy of 98.7%, surpassing the traditional models' plateau of 93.0%.
Role-specialised model diversity reached 100% accuracy on the evaluation set.

Resumen

This preprint reports results from 4,680 controlled evaluations across six frontier AI models, four experimental scenarios, six mathematical reasoning domains, and 90 contamination-minimised, formally verifiable problems. The central finding is an architectural phase transition in AI reliability: above approximately 95% accuracy, error structure undergoes a qualitative shift from stochastic to systematic, at which point compute scaling becomes fundamentally ineffective. The Generator–Auditor–Adversary–Synthesizer (GAAS) architecture breaks the single-agent ceiling: single-agent inference plateaus at 93.0%; self-consistency scaling yields only +1.5 pp (p=0.317, NS); role-separated GAAS achieves 98.7% (p<0.001); role-specialised model diversity achieves 100% on this evaluation set (Wilson CI: 95.9–100%). This is Empirical Study 1 of a planned three-study programme. Study 1 covers determinate domains (mathematical and logical reasoning). Studies 2 and 3 will extend to semi-determinate and indeterminate domains. Files included:- Main manuscript (PDF)- Online Appendix 1: All 90 evaluation questions, verified ground-truth answers, complete raw model responses for S1–S4, error taxonomy classifications, and experimental protocols (DOCX)

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo