Current AI agent benchmarks measure outcome fidelity — did the agent complete the task? — but do not separately evaluate process. We introduce FACADE (Fidelity Assessment of Constrained Agent Decision Execution), a benchmark that independently measures whether an agent gathered the information required to justify its action. This separation exposes a failure class invisible to outcome-only evaluation: agents that consistently reach correct outcomes through incomplete reasoning (Class 2 failure). Such agents pass standard benchmarks but are vulnerable to distribution shift and adversarial manipulation and indicate that behavioral evaluations are insufficient evidence of a reasoning process. Across 17 schemas and three vendor families (Anthropic, OpenAI, Google), we report three principal findings. First, a single change to a tool's name and description flips agent behavior from fully exploitable to fully compliant, establishing tool interface metadata as a causal variable in agent safety — an effect that replicates across all three vendors with statistical significance, though with varying magnitude. Second, conditional reasoning tasks (two-condition Boolean logic) achieve high outcome rates across all three vendors, while high-dimensional constraint satisfaction tasks (four or more intersecting constraints) achieve near-zero outcomes — a structural contrast that replicates independently of vendor. Third, structurally analogous tasks produce significantly different safety outcomes depending on domain vocabulary for two of three vendors (chi-squared p < 0.001), while the third vendor shows uniform vulnerability across all domains — a distinct failure pattern that itself carries diagnostic value. These findings suggest that tool interface design — naming conventions, response formatting, and information architecture — is an underexplored variable in agent safety evaluation. FACADE provides a framework for measuring this variable and identifying exploitable reasoning gaps before deployment.
Building similarity graph...
Analyzing shared references across papers
Loading...
Philip Forshaw (Tue,) studied this question.
synapsesocial.com/papers/699fe2eb95ddcd3a253e66e4 — DOI: https://doi.org/10.5281/zenodo.18762668
Philip Forshaw
MaineGeneral Medical Center
Mineral Resources
Building similarity graph...
Analyzing shared references across papers
Loading...