What question did this study set out to answer?

This paper aims to explore the effects of engineering information boundaries in agentic evaluation systems for citation verification.

June 19, 2026Open Access

How Sense Works: Specification-Boundary Engineering for Agentic Citation Verification

Key Points

This paper aims to explore the effects of engineering information boundaries in agentic evaluation systems for citation verification.
Developed a 12-cell citation-verification matrix with three models and four conditions.
Conducted 960 primary trials assessing model responses to structured prompts and forced evidence-use runs.
Utilized deterministic recomputation scripts for auditing results based on model-visible evidence.
Minimal no-tool conditions showed weak discrimination with Youden's J values below 0.325.
Structured prompting demonstrated strong model dependency with varying performance across models.
Forced evidence-use loops resulted in consistent performance across models with Youden's J clustering between +0.325 and +0.397.

Abstract

Agentic evaluation systems increasingly combine language models with tools, retrieval, gates, and repeated audit loops. This paper studies what happens when the information boundary is engineered rather than merely inspected after the fact. It presents a clean 12-cell citation-verification matrix crossing three models with four conditions: a minimal no-tool prompt, a structured no-tool prompt, and two forced evidence-use agent runs. The design separates model-visible evidence from scoring-only ground truth before evaluation and uses deterministic recomputation scripts to audit the resulting matrix. Across 960 primary trials, minimal no-tool conditions were weak or degenerate by balanced discrimination, and structured prompting was strongly model-dependent. Forced evidence-use loops produced the most consistent positive band: across three models and two agent runs, Youden's J clustered between +0.325 and +0.397. A separate 960-trial full-matrix rerun supports the stability interpretation: all six forced-tool cells remained within |Delta J| < 0.06 of the original matrix, while the largest structured-prompt drift reached |Delta J| = 0.162. Item-hardness analysis showed a bimodal pattern, with 48 citations easy, 28 hard, and 4 ambiguous under the admitted metadata evidence boundary. The paper contributes a constructive counterpart to the specification-boundary framework introduced in the companion paper "Why Sense Matters." It argues that forced evidence-use loops do not make citation verification generally solved; rather, they make the evaluated object more explicit, auditable, and reproducible within a defined information boundary. The release package includes the manuscript, figures, 960 primary raw trial files, 960 rerun trial files, matrix summaries, audit reports, provenance files, and scripts for recomputing the reported results without live API calls.

Read Full Paperexternally

Demander à l'IA

Bookmark

View Full Paper

Cite This Study

Jianeng Zhou (Wed,) studied this question.

synapsesocial.com/papers/6a34de7065a5b0777af2de06 https://doi.org/https://doi.org/10.5281/zenodo.20725397

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Demander à l'IA

Bookmark

View Full Paper