What question did this study set out to answer?

The research aims to systematically review multimodal image feature fusion technologies and their applications in complex perception systems.

May 27, 2026Open Access

A Review of Multimodal Image Feature Fusion Technology and Application

Key Points

The research aims to systematically review multimodal image feature fusion technologies and their applications in complex perception systems.
Followed PRISMA guidelines for literature review and data extraction.
Proposed a five-dimensional decoupling classification architecture for analysis.
Summarized qualitative and quantitative evaluation metrics and cross-domain datasets.
Highlighted the linear computational efficiency of Mamba architectures in long-sequence modeling.
Showed high-fidelity reconstruction capabilities of generative diffusion models.
Emphasized universal semantic alignment achieved by Vision Foundation Models.

Abstract

Multimodal image fusion has emerged as a core technology for complex perception systems—such as autonomous driving, remote sensing monitoring, and medical diagnosis—by integrating complementary information from heterogeneous sensors. Given the rapid technological evolution within this field, particularly driven by the emergence of Mamba architectures, Generative Diffusion Models, and Vision Foundation Models (VFMs), traditional classification methods no longer fully encompass the ongoing paradigm shifts. Following the PRISMA guidelines to ensure the objectivity and reproducibility of the findings, this paper provides a systematic literature review and data extraction for multimodal image feature fusion. Under this standardized framework, a five-dimensional decoupling classification architecture is proposed to deconstruct models across fusion hierarchy, backbone architecture, fusion operator, supervision paradigm, and deployment constraints. Specifically, the analysis highlights the linear computational efficiency of Mamba in long-sequence modeling, the high-fidelity reconstruction capabilities of diffusion models via generative priors, and the universal semantic alignment achieved by VFMs . Furthermore, this study summarizes qualitative and quantitative evaluation metrics alongside cross-domain public datasets for performance benchmarking while discussing critical future directions, including cross-modal alignment in complex environments, parameter-efficient fine-tuning of large models, and real-time inference at the edge.

Read Full Paperexternally

AIに質問

Bookmark

View Full Paper