We evaluate whether purely structural behavioral analysis — without reading agent-generated code or text — can distinguish successful from failed AI agent coding sessions. Using 1,000 real sessions from the SWE-agent framework on SWE-bench (500 resolved, 500 unresolved), we show that CAUM's zero-training, deterministic scoring engine achieves Cohen's d = 0.977 and AUC = 0.758, meaning structural behavior alone correctly ranks a random resolved/unresolved session pair 75.8% of the time. Cross-model validation on GPT-4o sessions confirms generalization (AUC = 0.676). We identify a behavioral ceiling at AUC ~0.83 via ML ensemble experiment. Resolved sessions are predominantly EXPLORER regime (56%), while failed sessions are predominantly GRIND (58%). All results achieved with zero training data and fully deterministic scoring. Related work: 'The Competence Illusion' (doi:10.5281/zenodo.18928199).
Andres Ricardo Silva Gasca (Sat,) studied this question.