Agentic evaluation workflows increasingly rely on loops: models call tools, retrieve references, receive feedback, pass through gates, and are checked by other agents or scripts. Such loops can verify many specified properties, but they do not define their own validity boundary. This paper argues that agentic evaluation is specification-bound before it is metric-bound: metrics are interpretable only after the evaluated object has been correctly specified. The argument is developed through a diagnostic case in LLM-based citation verification. A multi-round audit process recomputed metrics, checked trial counts, inspected scripts, and found several local inconsistencies, yet initially missed the decisive validity threat: model-visible reference records contained scoring-only answer labels, including hasₑrror and errorₜype. After model-visible references were separated from scoring-only ground truth, apparent GATE advantages collapsed: DeepSeek V3 GATE dropped from Youden's J = +0. 659 to +0. 015, and Grok GATE dropped from +0. 842 to +0. 105. A clean six-cell rerun is reported only as secondary diagnostic evidence, not as a model-ranking benchmark. The paper introduces the specification-boundary framework and distinguishes metric audit from condition-definition audit. Metric audit asks whether reported numbers follow from trial files. Condition-definition audit asks whether prompts, tools, files, memories, retrieval contexts, and reference stores instantiate the intended experimental condition. The prior capacity to identify validity threats, define information boundaries, and translate them into auditable constraints is called anticipatory specification judgment (ASJ). In agentic evaluation, asking whether a result "makes sense" is asking whether the underlying specification admits a coherent evaluable object.
Jianeng Zhou (Tue,) studied this question.