September 16, 2025

Mask-DiFuser: A Masked Diffusion Model for Unified Unsupervised Image Fusion.

Key Points

Mask-DiFuser generates fused images through a masked diffusion model, achieving better alignment with human perception.
Extensive tests show Mask-DiFuser significantly outperforms existing state-of-the-art methods in various fusion tasks.
The method employs a content encoder and a semantic encoder to effectively integrate local and global contextual information.
Innovative dual masking simulates complementary information, tackling challenges posed by the absence of ground truth in fusion tasks.

Abstract

The absence of ground truth (GT) in most fusion tasks poses significant challenges for model optimization, evaluation, and generalization. Existing fusion methods achieving complementary context aggregation predominantly rely on hand-crafted fusion rules and sophisticated loss functions, which introduce subjectivity and often fail to adapt to complex real-world scenarios. To address this challenge, we propose Mask-DiFuser, a novel fusion paradigm that ingeniously transforms the unsupervised image fusion task into a dual masked image reconstruction task by incorporating masked image modeling with a diffusion model, overcoming various issues arising from the absence of GT. In particular, we devise a dual masking scheme to simulate complementary information and employ a diffusion model to restore source images from two masked inputs, thereby aggregating complementary contexts. A content encoder with an attention parallel feature mixer is deployed to extract and integrate complementary features, offering local content guidance. Moreover, a semantic encoder is developed to supply global context which is integrated into the diffusion model via a cross-attention mechanism. During inference, Mask-DiFuser begins with a Gaussian distribution and iteratively denoises it conditioned on multi-source images to directly generate fused images. The masked diffusion model, learning priors from high-quality natural images, ensures that fusion results align more closely with human visual perception. Extensive experiments on several fusion tasks, including infrared-visible, medical, multi-exposure, and multi-focus image fusion, demonstrate that Mask-DiFuser significantly outshines SOTA fusion alternatives.

Mark Helpful

Bookmark

Relay