What question did this study set out to answer?

This research aims to develop a unified framework for detecting and eliminating backdoor attacks in diffusion models while minimizing data dependency.

May 21, 2026Open Access

A Unified Framework Based on Distribution Shift Modeling for Revealing and Eliminating Backdoor Attacks in Diffusion Models

Key Points

This research aims to develop a unified framework for detecting and eliminating backdoor attacks in diffusion models while minimizing data dependency.
Proposes a unified defense framework named DIFFDEFEND.
Implements a multi-stage joint trigger inversion method using distribution shifts.
Constructs a dual-modal detector leveraging uniformity score and total variation loss.
Achieves near-100% detection accuracy in identifying backdoored models.
Reduces backdoor attack success rate to nearly 0%.
Preserves the model's generation quality with minimal degradation.

Abstract

Diffusion models have achieved groundbreaking progress in image generation, text-to-image, and other multimodal generation tasks, becoming the mainstream architecture in the field of generative artificial intelligence. However, studies have shown that diffusion models are vulnerable to backdoor attacks. By injecting specific triggers into the training data, attackers can manipulate the model to generate preset target images during the inference phase, posing a serious security threat. Existing defense methods suffer from three major limitations: detection methods typically rely on prior knowledge of specific attack types or require large amounts of real data; removal methods lack theoretical modeling of the intrinsic mechanism of backdoor injection; and there is no unified, low-data-dependency defense framework. To address the above issues, this paper proposes a unified defense framework named DIFFDEFEND. For the first time, it summarizes the essence of backdoor injection as “layer-by-layer propagation of distribution shifts” and designs a complete solution that achieves high-precision detection and effective removal without requiring real data. Specifically, this paper first proposes a multi-stage joint trigger inversion method that exploits the consistency constraints of distribution shifts across multiple time steps to achieve stable recovery of the trigger. Second, it constructs a dual-modal detector that combines the uniformity score of generated images with total variation loss to achieve high-precision identification of backdoored models. Finally, it designs a distribution-guided purification mechanism that freezes a clean reference model and optimizes the removal loss and retention loss, rapidly eliminating backdoor effects without relying on real data while preserving the model’s generation quality. Extensive experiments on three mainstream architectures—DDPM, NCSN, and LDM—and 13 different samplers demonstrate that DIFFDEFEND achieves near-100% detection accuracy, reduces the backdoor attack success rate to nearly 0, and keeps the model’s generation quality essentially unchanged, significantly outperforming existing methods.

A Unified Framework Based on Distribution Shift Modeling for Revealing and Eliminating Backdoor Attacks in Diffusion Models

Key Points

Abstract

Cite This Study

Also Consider

Also Consider