What question did this study set out to answer?

The aim is to improve AI agent evaluation by formalizing consistency across multiple dimensions instead of relying solely on pass rate.

May 21, 2026Open Access

Formalizing Consistency in AI Agent Evaluation: Suite v0

Key Points

The aim is to improve AI agent evaluation by formalizing consistency across multiple dimensions instead of relying solely on pass rate.
Developed Suite v0, a 50-task benchmark for AI agents, based on a five-axis reproducibility hierarchy.
Implemented evaluations on an AI agent (Zeus) using four consistency axes and a cross-family critic constraint.
Conducted assessments to compare pass rates against artifact stability and scorer agreement.
62% of tasks pass, yet show structural variations upon rerun.
Disagreement observed between two scorers on identical artifacts, indicating evaluator pathology.
Increasing pass rate from 88% to 96% led to a decline in held-out generalization despite improved pass rate.

Abstract

Public AI agent benchmarks report a single scalar: pass rate. We argue this is a lossy projection of five orthogonal axes onto one dimension, and that the missingaxes have first-order consequences for production deployments. We formalize consistency as a five-axis hierarchy of reproducibility measures: exact, syntactic, lexical, behavioral, and decisional. We prove a monotonicity relation and an optimization tradeoff theorem showing that raising pass rate provides no guarantee of non-degradation in orthogonal axes. We demonstrate this formalization with Suite v0: a 50-task, 0/night, MIT-licensed benchmark instrumented with four consistency axes and a cross-family criticconstraint. On a real AI agent (Zeus), a single evaluation run surfaces: (1) 62% of tasks pass yet produce structurally different artifacts on each rerun; (2) twoscorers disagree on the same artifact (evaluator pathology) ; (3) prompt augmentations raising pass rate 88% to 96% concurrently degrade held-out generalization; and (4) client-side determinism (temperature=0, RNG seeding) is insufficient — rerun instability worsens slightly while pass rate improves. Suite v0 is reproducible in five commands on commodity hardware at zero cost.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Atakan Akbaba (Tue,) studied this question.

synapsesocial.com/papers/6a0ea196be05d6e3efb6077f https://doi.org/https://doi.org/10.5281/zenodo.20285100

Bookmark

View Full Paper