What question did this study set out to answer?

This research aims to explore the accuracy of AI systems across different reasoning domains, particularly formal and indeterminate scenarios.

March 18, 2026Open Access

The Reliability Chasm: AI Accuracy Across Three Reasoning Domains

Key Points

This research aims to explore the accuracy of AI systems across different reasoning domains, particularly formal and indeterminate scenarios.
Evaluated six language models on three reasoning domains with 7,950 scored data points.
Applied GAAS architecture to assess its impact on error correlation and accuracy.
Conducted cross-domain evaluations for error detection and reliability comparisons.
AI achieves 93% accuracy in formal reasoning tasks but only 66% in indeterminate futures.
GAAS reduces error correlation from 0.80 to 0.19, improving reliability on identical compute.
Causal errors were detected in 13 out of 18 evaluations, indicating significant performance drops in semi-determinate tasks.

Abstract

v2: Minor formatting corrections to manuscript file. Reference list reordered to citation sequence. Figure 1 updated to vector format. No changes to data, results, or scientific content. AI systems achieve 93% accuracy on formal reasoning benchmarks, yet the fastest-growing uses of AI concern indeterminate futures: market movements, sports outcomes, medical symptoms. Here we show—across three epistemically distinct problem domains and 7,950 individually scored data points using six frontier language models evaluated as research subjects—that AI reliability degrades sharply and predictably as question type shifts from formal to indeterminate. The governing variable is ρ̂, pairwise error correlation between ensemble components. Compute scaling yields ρ̂ = 0.80; GAAS role-separation (Generator–Auditor–Adversary–Synthesizer) reduces this to ρ̂ = 0.19—a four-fold improvement on identical compute. Formal-domain accuracy reaches 93.0% single-agent and 98.7% with GAAS architecture. Semi-determinate expert synthesis falls to 79.2%, with causal-hierarchy errors detected in 13 of 18 cross-domain evaluations. Indeterminate futures reach only 66.0%, with a calibration inversion: the highest point-estimate accuracy model (Gemini: 5.3% mean error) simultaneously achieves the second-lowest confidence-interval calibration (29.7% hit rate against a 90% target—a −60 pp gap). Intelligence and reliability are empirically dissociated in this dataset precisely where AI is deployed most consequentially. Architectural role-separation, not compute scaling, is the mechanism that bridges the gap.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper