What question did this study set out to answer?

The research aims to develop a framework for assessing the reasoning processes of AI agents, beyond mere task completion.

February 26, 2026Open Access

The FACADE Gap: Agents That Succeed Without Reasoning

Key Points

The research aims to develop a framework for assessing the reasoning processes of AI agents, beyond mere task completion.
Introduced the FACADE benchmark to evaluate agents' reasoning processes.
Examined performance across 17 schemas from three vendor families.
Analyzed the impact of tool interface metadata on agent compliance and safety.
Identified that naming and description changes can shift agent behavior significantly.
Found high success rates in basic conditional reasoning but very low in complex constraint tasks.
Demonstrated different safety outcomes based on domain vocabulary across vendors.

Abstract

Current AI agent benchmarks measure outcome fidelity — did the agent complete the task? — but do not separately evaluate process. We introduce FACADE (Fidelity Assessment of Constrained Agent Decision Execution), a benchmark that independently measures whether an agent gathered the information required to justify its action. This separation exposes a failure class invisible to outcome-only evaluation: agents that consistently reach correct outcomes through incomplete reasoning (Class 2 failure). Such agents pass standard benchmarks but are vulnerable to distribution shift and adversarial manipulation and indicate that behavioral evaluations are insufficient evidence of a reasoning process. Across 17 schemas and three vendor families (Anthropic, OpenAI, Google), we report three principal findings. First, a single change to a tool's name and description flips agent behavior from fully exploitable to fully compliant, establishing tool interface metadata as a causal variable in agent safety — an effect that replicates across all three vendors with statistical significance, though with varying magnitude. Second, conditional reasoning tasks (two-condition Boolean logic) achieve high outcome rates across all three vendors, while high-dimensional constraint satisfaction tasks (four or more intersecting constraints) achieve near-zero outcomes — a structural contrast that replicates independently of vendor. Third, structurally analogous tasks produce significantly different safety outcomes depending on domain vocabulary for two of three vendors (chi-squared p < 0.001), while the third vendor shows uniform vulnerability across all domains — a distinct failure pattern that itself carries diagnostic value. These findings suggest that tool interface design — naming conventions, response formatting, and information architecture — is an underexplored variable in agent safety evaluation. FACADE provides a framework for measuring this variable and identifying exploitable reasoning gaps before deployment.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Philip Forshaw

Actions

Institutions

MaineGeneral Medical Center

Mineral Resources

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

The FACADE Gap: Agents That Succeed Without Reasoning

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider

Also consider