Text-to-image (T2I) models have garnered attention for their expressive capabilities, highlighting the need for robust evaluation metrics. Text-image alignment, a critical evaluation aspect, has recently received broad attention, leading to various evaluation methods and benchmarks. Among T2I models, stable diffusion has become widely used for generating images from arbitrary prompts; however, their performance varies based on prompt difficulty, which prior studies have not comprehensively addressed. To bridge this gap, we propose a template-based evaluation method for controllable difficulty (TEM-CD). Our approach defines prompt difficulty levels and integrates attention maps with a vision-language model to enable precise evaluation across diverse image classes, attributes, and styles. We validated the method through two experiments. First, a user study confirmed that its results aligned more closely with human assessments than conventional metrics. Second, we applied it to three stable diffusion models (SD1.4, SD1.5, and SD2). Results showed that SD2 achieved superior expressiveness, particularly in attribute generation. Despite overall similarities, our method identified differences in generative capability. These findings highlight the robustness and reliability of the proposed approach, demonstrating its effectiveness as an evaluation method for stable diffusion models.
Fusa et al. (Thu,) studied this question.