What question did this study set out to answer?

The aim is to enhance image fusion quality for better downstream task performance by addressing semantic deficiencies.

April 24, 2026Open Access

Semantic-Guided Multi-Level Collaborative Fusion Network for Visible and Infrared Images

Key Points

The aim is to enhance image fusion quality for better downstream task performance by addressing semantic deficiencies.
Developed a semantic-guided multi-level collaborative fusion network called DSIFuse.
Leveraged semantic priors and global context from segmentation branches to refine cross-modal features.
Implemented a multi-scale decoder with a semantic compensation block to enhance overall image representations.
DSIFuse generated clear fusion images with improved structural consistency.
Demonstrated reduced artifacts compared to existing methods.
Achieved enhanced performance in downstream object detection tasks using the fused representations.

Abstract

The paramount value of image fusion is manifested in effectively enhancing downstream tasks. However, compatibility with subsequent tasks is compromised due to the semantic deficiency of fusion representations generated by current approaches. To mitigate this limitation, a semantic-guided multi-level collaborative fusion network is proposed, termed DSIFuse. By leveraging semantic priors and global context extracted from auxiliary segmentation branches, a multi-level interaction space is constructed to explicitly refine cross-modal features. Specifically, a cross-modal feature correction mechanism is designed to enhance semantic alignment by injecting complementary visible–infrared information at each layer, while a three-level interaction strategy gradually integrates unimodal features and semantic maps to generate semantically enriched representations. To mitigate semantic information loss during image reconstruction, a semantic compensation block is employed, incorporating interactive representations from prior layers and global semantic maps into the multi-scale decoder. Finally, the overall loss integrates semantic supervision, gradient, and intensity loss. Experiments conducted on public datasets indicate that clear fusion images are generated by DSIFuse, with improved structural consistency and reduced artifacts. Under a unified benchmark, the fused representations subsequently yield improved performance in downstream object detection tasks.

Semantic-Guided Multi-Level Collaborative Fusion Network for Visible and Infrared Images

Key Points

Abstract

Cite This Study