This preprint reports 180 controlled evaluations across six frontier AI models (Claude Sonnet 4.5, ChatGPT GPT-5.2, Grok 4/4.1, DeepSeek V2.5, Gemini 3, Perplexity), three independent evaluation runs, and ten cross-domain reasoning problems spanning eight scientific disciplines (applied physics, limnology, clinical biostatistics, cardiovascular physiology, evolutionary biology, atmospheric chemistry, Bayesian statistics, evolutionary anthropology, fluid dynamics, and game theory/ecology). The central finding is a 14-percentage-point reliability downturn relative to Study 1 formal domains (79.2% vs 93.0%), placing semi-determinate performance in Regime 2 (mixed stochastic-systematic error structure). Performance stratifies into a reliable core (Q3, Q6, Q7, Q8: 100% across all 18 evaluations) and a variable periphery determined by depth of mechanistic recall required. Critically, self-audit calibration (Spearman ρₛ) emerges as an independent architectural design variable: only Claude (ρₛ = 0.903) and Gemini (ρₛ = 0.703) meet the proposed Auditor-quality threshold. Kruskal-Wallis confirms significant model-level differentiation (H = 12.779, p = 0.026). This is Empirical Study 2 of a planned three-study programme. Study 1 covered formal determinate domains (mathematical reasoning). Study 3 will extend to fully indeterminate domains. Files included:- Main manuscript (PDF)- Supplementary Data S2: Complete 180-evaluation dataset, all model responses across 3 runs, self-audit scores, logic trail analysis, inter-rater reliability data, and scoring rubrics (PDF)
Building similarity graph...
Analyzing shared references across papers
Loading...
Kuldeep Kumar Pandit
Vatsala Pandit
Aayan Pandit
Building similarity graph...
Analyzing shared references across papers
Loading...
Pandit et al. (Sun,) studied this question.
www.synapsesocial.com/papers/69b8f10fdeb47d591b8c5df7 — DOI: https://doi.org/10.5281/zenodo.19037248