Text-to-image diffusion models such as Stable Diffusion enable high-quality image synthesis from text and are widely deployed due to their open-source nature and low computational requirements. However, this accessibility also makes them attractive targets for misuse, including the generation of not-safe-for-work and otherwise restricted content. In this paper, we propose EvilPrompt, a jailbreak attack that exploits the cross-attention mechanism in Stable Diffusion. The attack operates purely at inference time using plain-text prompts and does not require fine-tuning or modification of model parameters. By selectively reweighting cross-attention for specific tokens, EvilPrompt preserves the overall semantic structure of the prompt while steering the generation toward prohibited content. This enables fine-grained control over malicious semantics without introducing explicit unsafe keywords. We evaluate EvilPrompt on two real-world prompt sets, 4chan and Lexica, each containing 500 prompts. The attack achieves an Attack Success Rate (ASR) of 97. 4% on 4chan and 98. 0% on Lexica, yielding an overall average ASR of 97. 7%. The attack maintains high semantic alignment between prompts and generated images. Bootstrapping Language-Image Pre-training (BLIP) similarity consistently exceeds 0. 75 across all categories on both datasets. Human evaluation further confirms high visual realism, with mean scores above 7. 0 on a 10-point scale, and strong semantic consistency, with mean scores above 7. 3. These results demonstrate that cross-attention manipulation provides an effective and practical jailbreak pathway. We further analyze how commonly used text-level moderation affects the success of such attacks. Although the strongest defense configuration (HateCoT with GPT-4) reduces the ASR to 5. 9%, it introduces 21. 5 s of additional latency and a cost of 0. 01182 per query. Lighter-weight alternatives such as Perspective API leave nearly half (45. 0%) of attacks successful. These observations indicate that safeguards acting only on the input or final output are insufficient to capture attention-level manipulations. Overall, our results reveal a fundamental limitation of post-generation safety pipelines when confronted with inference-time control of cross-attention.
Zhuang et al. (Thu,) studied this question.