What question did this study set out to answer?

This paper explores the validity boundaries in agentic evaluation workflows, particularly in citation verification tasks.

June 18, 2026Open Access

Why Sense Matters: Answer-Key Leakage and the Specification Boundary in Agentic Citation Verification

Puntos clave

This paper explores the validity boundaries in agentic evaluation workflows, particularly in citation verification tasks.
Developed through a diagnostic case in LLM-based citation verification
Conducted a multi-round audit process to recompute metrics and check trial counts
Separated model-visible references from scoring-only ground truth.
DeepSeek V3 GATE dropped from Youden's J = +0.659 to +0.015
Grok GATE dropped from +0.842 to +0.105
Identified a validity threat due to model-visible reference records containing scoring-only labels.

Resumen

Agentic evaluation workflows increasingly rely on loops: models call tools, retrieve references, receive feedback, pass through gates, and are checked by other agents or scripts. Such loops can verify many specified properties, but they do not define their own validity boundary. This paper argues that agentic evaluation is specification-bound before it is metric-bound: metrics are interpretable only after the evaluated object has been correctly specified. The argument is developed through a diagnostic case in LLM-based citation verification. A multi-round audit process recomputed metrics, checked trial counts, inspected scripts, and found several local inconsistencies, yet initially missed the decisive validity threat: model-visible reference records contained scoring-only answer labels, including hasₑrror and errorₜype. After model-visible references were separated from scoring-only ground truth, apparent GATE advantages collapsed: DeepSeek V3 GATE dropped from Youden's J = +0. 659 to +0. 015, and Grok GATE dropped from +0. 842 to +0. 105. A clean six-cell rerun is reported only as secondary diagnostic evidence, not as a model-ranking benchmark. The paper introduces the specification-boundary framework and distinguishes metric audit from condition-definition audit. Metric audit asks whether reported numbers follow from trial files. Condition-definition audit asks whether prompts, tools, files, memories, retrieval contexts, and reference stores instantiate the intended experimental condition. The prior capacity to identify validity threats, define information boundaries, and translate them into auditable constraints is called anticipatory specification judgment (ASJ). In agentic evaluation, asking whether a result "makes sense" is asking whether the underlying specification admits a coherent evaluable object.

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo