August 15, 2025Open Access

Language-Driven Cross-Attention for Visible–Infrared Image Fusion Using CLIP

Key Points

Fused images significantly improve scene understanding, aiding in navigation and localization for robotics.
The method leverages a cross-domain attention mechanism alongside linguistic information from the CLIP model.
Strong performance was demonstrated on benchmark datasets, achieving metrics like SF = 2.1381 and VIF = 45.1842 on LLVIP.
Integration of visual and semantic data enhances both the detail and understanding of the fused output images.

Abstract

Language-guided multimodal fusion, which integrates information from both visible and infrared images, has shown strong performance in image fusion tasks. In low-light or complex environments, a single modality often fails to fully capture scene features, whereas fused images enable robots to obtain multidimensional scene understanding for navigation, localization, and environmental perception. This capability is particularly important in applications such as autonomous driving, intelligent surveillance, and search-and-rescue operations, where accurate recognition and efficient decision-making are critical. To enhance the effectiveness of multimodal fusion, we propose a text-guided infrared and visible image fusion network. The framework consists of two key components: an image fusion branch, which employs a cross-domain attention mechanism to merge multimodal features, and a text-guided module, which leverages the CLIP model to extract semantic cues from image descriptions containing visible content. These semantic parameters are then used to guide the feature modulation process during fusion. By integrating visual and linguistic information, our framework is capable of generating high-quality color-fused images that not only enhance visual detail but also enrich semantic understanding. On benchmark datasets, our method achieves strong quantitative performance: SF = 2.1381, Qab/f = 0.6329, MI = 14.2305, SD = 0.8527, VIF = 45.1842 on LLVIP, and SF = 1.3149, Qab/f = 0.5863, MI = 13.9676, SD = 94.7203, VIF = 0.7746 on TNO. These results highlight the robustness and scalability of our model, making it a promising solution for real-world multimodal perception applications.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Wang et al. (Fri,) studied this question.

synapsesocial.com/papers/68af509bad7bf08b1ead87b3 https://doi.org/https://doi.org/10.3390/s25165083

Bookmark

View Full Paper