Public AI agent benchmarks report a single scalar: pass rate. We argue this is a lossy projection of five orthogonal axes onto one dimension, and that the missingaxes have first-order consequences for production deployments. We formalize consistency as a five-axis hierarchy of reproducibility measures: exact, syntactic, lexical, behavioral, and decisional. We prove a monotonicity relation and an optimization tradeoff theorem showing that raising pass rate provides no guarantee of non-degradation in orthogonal axes. We demonstrate this formalization with Suite v0: a 50-task, 0/night, MIT-licensed benchmark instrumented with four consistency axes and a cross-family criticconstraint. On a real AI agent (Zeus), a single evaluation run surfaces: (1) 62% of tasks pass yet produce structurally different artifacts on each rerun; (2) twoscorers disagree on the same artifact (evaluator pathology) ; (3) prompt augmentations raising pass rate 88% to 96% concurrently degrade held-out generalization; and (4) client-side determinism (temperature=0, RNG seeding) is insufficient — rerun instability worsens slightly while pass rate improves. Suite v0 is reproducible in five commands on commodity hardware at zero cost.
Atakan Akbaba (Tue,) studied this question.