What question did this study set out to answer?

The research identifies the risks of unsafe image generation in diffusion models by exploiting cross-attention mechanisms.

May 9, 2026Open Access

Unveiling the Risk of Unsafe Image Generation in Stable Diffusion Through a Cross-Attention Mechanism

Key Points

The research identifies the risks of unsafe image generation in diffusion models by exploiting cross-attention mechanisms.
Proposed EvilPrompt for manipulating cross-attention at inference time without model modification.
Evaluated the attack on 4chan and Lexica prompt sets, each comprising 500 prompts.
Analyzed the impact of various text-level moderation techniques on attack efficacy.
Achieved an average attack success rate of 97.7% across evaluated datasets.
Maintained high semantic alignment with BLIP similarity exceeding 0.75 on all categories.
Noted that the strongest defense reduced attack success to 5.9% but increased latency significantly.

Abstract

Text-to-image diffusion models such as Stable Diffusion enable high-quality image synthesis from text and are widely deployed due to their open-source nature and low computational requirements. However, this accessibility also makes them attractive targets for misuse, including the generation of not-safe-for-work and otherwise restricted content. In this paper, we propose EvilPrompt, a jailbreak attack that exploits the cross-attention mechanism in Stable Diffusion. The attack operates purely at inference time using plain-text prompts and does not require fine-tuning or modification of model parameters. By selectively reweighting cross-attention for specific tokens, EvilPrompt preserves the overall semantic structure of the prompt while steering the generation toward prohibited content. This enables fine-grained control over malicious semantics without introducing explicit unsafe keywords. We evaluate EvilPrompt on two real-world prompt sets, 4chan and Lexica, each containing 500 prompts. The attack achieves an Attack Success Rate (ASR) of 97. 4% on 4chan and 98. 0% on Lexica, yielding an overall average ASR of 97. 7%. The attack maintains high semantic alignment between prompts and generated images. Bootstrapping Language-Image Pre-training (BLIP) similarity consistently exceeds 0. 75 across all categories on both datasets. Human evaluation further confirms high visual realism, with mean scores above 7. 0 on a 10-point scale, and strong semantic consistency, with mean scores above 7. 3. These results demonstrate that cross-attention manipulation provides an effective and practical jailbreak pathway. We further analyze how commonly used text-level moderation affects the success of such attacks. Although the strongest defense configuration (HateCoT with GPT-4) reduces the ASR to 5. 9%, it introduces 21. 5 s of additional latency and a cost of 0. 01182 per query. Lighter-weight alternatives such as Perspective API leave nearly half (45. 0%) of attacks successful. These observations indicate that safeguards acting only on the input or final output are insufficient to capture attention-level manipulations. Overall, our results reveal a fundamental limitation of post-generation safety pipelines when confronted with inference-time control of cross-attention.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper