What question did this study set out to answer?

To investigate how self-audit calibration influences architectural effectiveness in semi-determinate AI domains.

March 17, 2026Open Access

AI Reliability Programme — Empirical Study 2: Self-Audit Calibration Governs Architectural Effectiveness in Semi-Determinate Domains (Pandit et al., 2026)

Key Points

To investigate how self-audit calibration influences architectural effectiveness in semi-determinate AI domains.
Conducted 180 controlled evaluations across six AI models
Performed three independent evaluation runs
Analyzed ten cross-domain reasoning problems across eight scientific disciplines
Applied the Kruskal-Wallis test for model-level differentiation
Observed a 14-percentage-point reliability downturn from formal to semi-determinate domains
Identified a reliable core of models performing 100% across evaluations
Established self-audit calibration as a critical independent variable in architectural design
Found significant model-level differentiation with Kruskal-Wallis test results showing H = 12.779, p = 0.026

Abstract

This preprint reports 180 controlled evaluations across six frontier AI models (Claude Sonnet 4.5, ChatGPT GPT-5.2, Grok 4/4.1, DeepSeek V2.5, Gemini 3, Perplexity), three independent evaluation runs, and ten cross-domain reasoning problems spanning eight scientific disciplines (applied physics, limnology, clinical biostatistics, cardiovascular physiology, evolutionary biology, atmospheric chemistry, Bayesian statistics, evolutionary anthropology, fluid dynamics, and game theory/ecology). The central finding is a 14-percentage-point reliability downturn relative to Study 1 formal domains (79.2% vs 93.0%), placing semi-determinate performance in Regime 2 (mixed stochastic-systematic error structure). Performance stratifies into a reliable core (Q3, Q6, Q7, Q8: 100% across all 18 evaluations) and a variable periphery determined by depth of mechanistic recall required. Critically, self-audit calibration (Spearman ρₛ) emerges as an independent architectural design variable: only Claude (ρₛ = 0.903) and Gemini (ρₛ = 0.703) meet the proposed Auditor-quality threshold. Kruskal-Wallis confirms significant model-level differentiation (H = 12.779, p = 0.026). This is Empirical Study 2 of a planned three-study programme. Study 1 covered formal determinate domains (mathematical reasoning). Study 3 will extend to fully indeterminate domains. Files included:- Main manuscript (PDF)- Supplementary Data S2: Complete 180-evaluation dataset, all model responses across 3 runs, self-audit scores, logic trail analysis, inter-rater reliability data, and scoring rubrics (PDF)

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper