What question did this study set out to answer?

This research aims to determine if structural behavioral analysis can differentiate between successful and unsuccessful AI agent sessions without analyzing code or text.

March 16, 2026Open Access

Structural Behavioral Auditing Discriminates Successful from Failed AI Agent Sessions: A 1,000-Session Real-World Validation

Key Points

This research aims to determine if structural behavioral analysis can differentiate between successful and unsuccessful AI agent sessions without analyzing code or text.
Evaluation of 1,000 sessions from the SWE-agent framework.
Comparison of 500 resolved sessions to 500 unresolved sessions.
Use of CAUM's zero-training, deterministic scoring engine for ranking.
Cross-model validation on GPT-4o sessions for generalization.
Cohen's d of 0.977 indicates a strong effect size between resolved and unresolved sessions.
AUC of 0.758 shows structural behavior can correctly rank session pairs 75.8% of the time.
Identified a behavioral ceiling at AUC ~0.83 through machine learning ensemble experiments.
Predominance of EXPLORER regime in resolved sessions (56%) and GRIND regime in failed sessions (58%).

Abstract

We evaluate whether purely structural behavioral analysis — without reading agent-generated code or text — can distinguish successful from failed AI agent coding sessions. Using 1,000 real sessions from the SWE-agent framework on SWE-bench (500 resolved, 500 unresolved), we show that CAUM's zero-training, deterministic scoring engine achieves Cohen's d = 0.977 and AUC = 0.758, meaning structural behavior alone correctly ranks a random resolved/unresolved session pair 75.8% of the time. Cross-model validation on GPT-4o sessions confirms generalization (AUC = 0.676). We identify a behavioral ceiling at AUC ~0.83 via ML ensemble experiment. Resolved sessions are predominantly EXPLORER regime (56%), while failed sessions are predominantly GRIND (58%). All results achieved with zero training data and fully deterministic scoring. Related work: 'The Competence Illusion' (doi:10.5281/zenodo.18928199).

Structural Behavioral Auditing Discriminates Successful from Failed AI Agent Sessions: A 1,000-Session Real-World Validation

Key Points

Abstract

Cite This Study