What question did this study set out to answer?

The study investigates how vision-language models assess near-identical design variants and their ability to provide reliable directional judgments.

May 18, 2026Open Access

Stability Without Sensitivity: Evaluative Description and Directional Judgment in Vision–Language Models

Key Points

The study investigates how vision-language models assess near-identical design variants and their ability to provide reliable directional judgments.
Evaluated three open-source vision-language models across eight design pairs using 100 repeated runs per pair.
Analyzed rating stability and preference consistency in AI assessments compared to professional designer evaluations.
Conducted a complementary evaluation with 13 professional designers for contrasting insights.
Models produced stable evaluations but lacked consistent directional judgment across repetitions.
Human designers made reliable directional judgments by prioritizing decisive design aspects.
AI evaluations demonstrated consistent application of criteria but struggled to provide actionable next-step recommendations.

Abstract

• High-fidelity iteration bottleneck Near-identical variants demand priority-setting to preserve design intent and commit to a refinement direction. • Repeated paired evaluation protocol We introduce a repeated paired evaluation protocol (100 repeated runs per pair) to test whether AI assessments remain decision-reliable under near-equivalence rather than merely plausible in a single run. • Core finding Across repeated paired comparisons of near-identical images, VLMs produce stable evaluations yet fail to converge on a consistent judgment, a directional, action-guiding preference, whereas professional designers reliably make such judgments and justify them by prioritizing decisive concerns that establish a clear next-step design direction. • Why it matters (AIGC + innovation workflows) In AIGC-heavy innovation workflows with many plausible variants, progress toward design excellence depends on sensitivity to subtle distinctions and the evaluative ability to decide what matters most now. • Implication for AI-supported tools Current VLM evaluations can reproduce evaluative language and criteria, but they do not reliably translate those assessments into a stable, action-guiding direction under near-equivalence, limiting their usefulness for late-stage selection and iteration. Comparing visually near-identical design alternatives is a recurring challenge in late-stage refinement, where progress often depends on subtle distinctions in perceived quality, coherence, or intent. As AI-generated imagery accelerates the production of high-fidelity variants, this raises a practice-facing question: can vision–language models (VLMs) support the kind of comparative judgment designers rely on to choose a direction? We examine VLM evaluation under near-equivalence using repeated paired comparison. Three open-source VLMs assessed eight near-identical design pairs over 100 repeated runs per pair. We analyze rating stability, paired score differences and preference consistency across runs, and cross-model convergence. A complementary study with 13 professional designers provides contrast in how experts establish direction under the same conditions. Results show that while models apply evaluative criteria consistently, their comparative outcomes converge toward equivalence under repetition and do not yield a stable direction for selection or next-step change. Human experts, by contrast, consistently make directional judgments by prioritizing decisive concerns and articulating actionable refinement rationales. The study distinguishes fluent evaluation from decision-relevant judgment and clarifies a boundary for current AI-supported design evaluation in contexts where iteration and commitment are central to design practice.

Read Full Paperexternally

Ask AI

Helpful

Bookmark

View Full Paper