Current AI agent benchmarks measure outcome fidelity — did the agent complete the task? — but do not separately evaluate process. We introduce FACADE (Fidelity Assessment of Constrained Agent Decision Execution), a benchmark that independently measures whether an agent gathered the information required to justify its action. This separation exposes a failure class invisible to outcome-only evaluation: agents that consistently reach correct outcomes through incomplete reasoning (Class 2 failure). Such agents pass standard benchmarks but are vulnerable to distribution shift and adversarial manipulation and indicate that behavioral evaluations are insufficient evidence of a reasoning process. Across 17 schemas and three vendor families (Anthropic, OpenAI, Google), we report three principal findings. First, a single change to a tool's name and description flips agent behavior from fully exploitable to fully compliant, establishing tool interface metadata as a causal variable in agent safety — an effect that replicates across all three vendors with statistical significance, though with varying magnitude. Second, conditional reasoning tasks (two-condition Boolean logic) achieve high outcome rates across all three vendors, while high-dimensional constraint satisfaction tasks (four or more intersecting constraints) achieve near-zero outcomes — a structural contrast that replicates independently of vendor. Third, structurally analogous tasks produce significantly different safety outcomes depending on domain vocabulary for two of three vendors (chi-squared p < 0.001), while the third vendor shows uniform vulnerability across all domains — a distinct failure pattern that itself carries diagnostic value. These findings suggest that tool interface design — naming conventions, response formatting, and information architecture — is an underexplored variable in agent safety evaluation. FACADE provides a framework for measuring this variable and identifying exploitable reasoning gaps before deployment.
Building similarity graph...
Analyzing shared references across papers
Loading...
Philip Forshaw
MaineGeneral Medical Center
Mineral Resources
Building similarity graph...
Analyzing shared references across papers
Loading...
Philip Forshaw (Tue,) studied this question.
synapsesocial.com/papers/699fe2eb95ddcd3a253e66e4 — DOI: https://doi.org/10.5281/zenodo.18762668
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: