What type of study is this?

This is a Quantitative Study study.

October 10, 2025Open Access

Multimodal Image Segmentation with Dynamic Adaptive Window and Cross-Scale Fusion for Heterogeneous Data Environments

Key Points

DyFuseNet achieved state-of-the-art segmentation performance with an mIoU score of 80.20% on benchmark datasets.
Integration of a dynamic window module enhances representation for irregular objects, improving accuracy in complicated scenes.
The model excelled in challenging environments, achieving an average F1 score of 85% and demonstrating high computational efficiency.
A versatile solution for multi-modal image analysis highlights the importance of cross-scale feature fusion in practical applications.

Abstract

Multi-modal image segmentation is a key task in various fields such as urban planning, infrastructure monitoring, and environmental analysis. However, it remains challenging due to complex scenes, varying object scales, and the integration of heterogeneous data sources (such as RGB, depth maps, and infrared). To address these challenges, we proposed a novel multi-modal segmentation framework, DyFuseNet, which features dynamic adaptive windows and cross-scale feature fusion capabilities. This framework consists of three key components: (1) Dynamic Window Module (DWM), which uses dynamic partitioning and continuous position bias to adaptively adjust window sizes, thereby improving the representation of irregular and fine-grained objects; (2) Scale Context Attention (SCA), a hierarchical mechanism that associates local details with global semantics in a coarse-to-fine manner, enhancing segmentation accuracy in low-texture or occluded regions; and (3) Hierarchical Adaptive Fusion Architecture (HAFA), which aligns and fuses features from multiple modalities through shallow synchronization and deep channel attention, effectively balancing complementarity and redundancy. Evaluated on benchmark datasets (such as ISPRS Vaihingen and Potsdam), DyFuseNet achieved state-of-the-art performance, with mean Intersection over Union (mIoU) scores of 80.20% and 80.65%, surpassing MFTransNet by 1.71% and 1.57%, respectively. The model also demonstrated strong robustness in challenging scenes (such as building edges and shadowed objects), achieving an average F1 score of 85% while maintaining high efficiency (26.19 GFLOPs, 30.09 FPS), making it suitable for real-time deployment. This work presents a practical, versatile, and computationally efficient solution for multi-modal image analysis, with potential applications beyond remote sensing, including smart monitoring, industrial inspection, and multi-source data fusion tasks.

Multimodal Image Segmentation with Dynamic Adaptive Window and Cross-Scale Fusion for Heterogeneous Data Environments

Key Points

Abstract

Cite This Study