• High-fidelity iteration bottleneck Near-identical variants demand priority-setting to preserve design intent and commit to a refinement direction. • Repeated paired evaluation protocol We introduce a repeated paired evaluation protocol (100 repeated runs per pair) to test whether AI assessments remain decision-reliable under near-equivalence rather than merely plausible in a single run. • Core finding Across repeated paired comparisons of near-identical images, VLMs produce stable evaluations yet fail to converge on a consistent judgment, a directional, action-guiding preference, whereas professional designers reliably make such judgments and justify them by prioritizing decisive concerns that establish a clear next-step design direction. • Why it matters (AIGC + innovation workflows) In AIGC-heavy innovation workflows with many plausible variants, progress toward design excellence depends on sensitivity to subtle distinctions and the evaluative ability to decide what matters most now. • Implication for AI-supported tools Current VLM evaluations can reproduce evaluative language and criteria, but they do not reliably translate those assessments into a stable, action-guiding direction under near-equivalence, limiting their usefulness for late-stage selection and iteration. Comparing visually near-identical design alternatives is a recurring challenge in late-stage refinement, where progress often depends on subtle distinctions in perceived quality, coherence, or intent. As AI-generated imagery accelerates the production of high-fidelity variants, this raises a practice-facing question: can vision–language models (VLMs) support the kind of comparative judgment designers rely on to choose a direction? We examine VLM evaluation under near-equivalence using repeated paired comparison. Three open-source VLMs assessed eight near-identical design pairs over 100 repeated runs per pair. We analyze rating stability, paired score differences and preference consistency across runs, and cross-model convergence. A complementary study with 13 professional designers provides contrast in how experts establish direction under the same conditions. Results show that while models apply evaluative criteria consistently, their comparative outcomes converge toward equivalence under repetition and do not yield a stable direction for selection or next-step change. Human experts, by contrast, consistently make directional judgments by prioritizing decisive concerns and articulating actionable refinement rationales. The study distinguishes fluent evaluation from decision-relevant judgment and clarifies a boundary for current AI-supported design evaluation in contexts where iteration and commitment are central to design practice.
Sun et al. (Fri,) studied this question.