Agentic evaluation systems increasingly combine language models with tools, retrieval, gates, and repeated audit loops. This paper studies what happens when the information boundary is engineered rather than merely inspected after the fact. It presents a clean 12-cell citation-verification matrix crossing three models with four conditions: a minimal no-tool prompt, a structured no-tool prompt, and two forced evidence-use agent runs. The design separates model-visible evidence from scoring-only ground truth before evaluation and uses deterministic recomputation scripts to audit the resulting matrix. Across 960 primary trials, minimal no-tool conditions were weak or degenerate by balanced discrimination, and structured prompting was strongly model-dependent. Forced evidence-use loops produced the most consistent positive band: across three models and two agent runs, Youden's J clustered between +0.325 and +0.397. A separate 960-trial full-matrix rerun supports the stability interpretation: all six forced-tool cells remained within |Delta J| < 0.06 of the original matrix, while the largest structured-prompt drift reached |Delta J| = 0.162. Item-hardness analysis showed a bimodal pattern, with 48 citations easy, 28 hard, and 4 ambiguous under the admitted metadata evidence boundary. The paper contributes a constructive counterpart to the specification-boundary framework introduced in the companion paper "Why Sense Matters." It argues that forced evidence-use loops do not make citation verification generally solved; rather, they make the evaluated object more explicit, auditable, and reproducible within a defined information boundary. The release package includes the manuscript, figures, 960 primary raw trial files, 960 rerun trial files, matrix summaries, audit reports, provenance files, and scripts for recomputing the reported results without live API calls.
Jianeng Zhou (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: