Multimodal image fusion has emerged as a core technology for complex perception systems—such as autonomous driving, remote sensing monitoring, and medical diagnosis—by integrating complementary information from heterogeneous sensors. Given the rapid technological evolution within this field, particularly driven by the emergence of Mamba architectures, Generative Diffusion Models, and Vision Foundation Models (VFMs), traditional classification methods no longer fully encompass the ongoing paradigm shifts. Following the PRISMA guidelines to ensure the objectivity and reproducibility of the findings, this paper provides a systematic literature review and data extraction for multimodal image feature fusion. Under this standardized framework, a five-dimensional decoupling classification architecture is proposed to deconstruct models across fusion hierarchy, backbone architecture, fusion operator, supervision paradigm, and deployment constraints. Specifically, the analysis highlights the linear computational efficiency of Mamba in long-sequence modeling, the high-fidelity reconstruction capabilities of diffusion models via generative priors, and the universal semantic alignment achieved by VFMs . Furthermore, this study summarizes qualitative and quantitative evaluation metrics alongside cross-domain public datasets for performance benchmarking while discussing critical future directions, including cross-modal alignment in complex environments, parameter-efficient fine-tuning of large models, and real-time inference at the edge.
Cao et al. (Mon,) studied this question.