What question did this study set out to answer?

To improve water surface object detection using a multimodal RGB-IR sensing framework that adapts vision foundation models.

April 19, 2026Open Access

WS-R-IR Adapter: A Multimodal RGB–Infrared Remote Sensing Framework for Water Surface Object Detection

Key Points

To improve water surface object detection using a multimodal RGB-IR sensing framework that adapts vision foundation models.
Developed WS-R-IR Adapter for object detection in water scenes.
Implemented a water scene domain-aware modal adapter and a parallel multi-scale structural perception module.
Introduced an adaptive RGB-IR feature modulation fusion strategy and a context semantic fusion module.
Created the PoLaRIS-DET and ASV-RI-DET datasets for enhanced training and evaluation.
Achieved mAP@0.5:0.95 scores of 74.2% and 50.2% on the new datasets.
Outperformed existing methods with only 11.9M additional parameters.
Demonstrated effective parameter-efficient adaptation for performance improvement.

Abstract

Water surface object detection in shipborne remote sensing is challenged by unstable wave-induced backgrounds, illumination variations, extreme scale changes with tiny objects, and limited annotations. Multimodal RGB–infrared (RGB–IR) sensing leverages complementary visible and infrared cues to enhance robustness. However, most existing RGB–IR methods rely on backbones pretrained on limited-scale data, which constrain their performance for complex water surface scenes. In this work, we propose the WS-R-IR Adapter, a parameter-efficient vision foundation model (VFM)-based framework for shipborne RGB–IR object detection. Instead of full fine-tuning, it adapts frozen VFM representations via lightweight task-specific designs. the WS-R-IR Adapter includes (1) a water scene domain-aware modal adapter that progressively guides frozen backbone features with evolving semantic cues, (2) a parallel multi-scale structural perception module for fine-grained, scale-sensitive modeling, (3) an adaptive RGB–IR feature modulation fusion strategy, and (4) a resolution-aligned context semantic and structural detail fusion module. Moreover, we introduce an object-guided global-to-local registration framework to address dynamic cross-modal misalignment, and construct modality-aligned PoLaRIS-DET and ASV-RI-DET datasets that cover diverse water surface scenes. On the two datasets, the proposed method achieves mAP@0.5:0.95 scores of 74.2% and 50.2%, respectively, significantly outperforming existing methods with only 11.9M additional parameters. These results demonstrate the effectiveness of parameter-efficient VFM adaptation for multimodal water surface remote sensing.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Xue et al. (Fri,) studied this question.

synapsesocial.com/papers/69e47250010ef96374d8e71a https://doi.org/https://doi.org/10.3390/rs18081220

Bookmark

View Full Paper