Diffusion models have achieved groundbreaking progress in image generation, text-to-image, and other multimodal generation tasks, becoming the mainstream architecture in the field of generative artificial intelligence. However, studies have shown that diffusion models are vulnerable to backdoor attacks. By injecting specific triggers into the training data, attackers can manipulate the model to generate preset target images during the inference phase, posing a serious security threat. Existing defense methods suffer from three major limitations: detection methods typically rely on prior knowledge of specific attack types or require large amounts of real data; removal methods lack theoretical modeling of the intrinsic mechanism of backdoor injection; and there is no unified, low-data-dependency defense framework. To address the above issues, this paper proposes a unified defense framework named DIFFDEFEND. For the first time, it summarizes the essence of backdoor injection as “layer-by-layer propagation of distribution shifts” and designs a complete solution that achieves high-precision detection and effective removal without requiring real data. Specifically, this paper first proposes a multi-stage joint trigger inversion method that exploits the consistency constraints of distribution shifts across multiple time steps to achieve stable recovery of the trigger. Second, it constructs a dual-modal detector that combines the uniformity score of generated images with total variation loss to achieve high-precision identification of backdoored models. Finally, it designs a distribution-guided purification mechanism that freezes a clean reference model and optimizes the removal loss and retention loss, rapidly eliminating backdoor effects without relying on real data while preserving the model’s generation quality. Extensive experiments on three mainstream architectures—DDPM, NCSN, and LDM—and 13 different samplers demonstrate that DIFFDEFEND achieves near-100% detection accuracy, reduces the backdoor attack success rate to nearly 0, and keeps the model’s generation quality essentially unchanged, significantly outperforming existing methods.
Yang et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: