Water surface object detection in shipborne remote sensing is challenged by unstable wave-induced backgrounds, illumination variations, extreme scale changes with tiny objects, and limited annotations. Multimodal RGB–infrared (RGB–IR) sensing leverages complementary visible and infrared cues to enhance robustness. However, most existing RGB–IR methods rely on backbones pretrained on limited-scale data, which constrain their performance for complex water surface scenes. In this work, we propose the WS-R-IR Adapter, a parameter-efficient vision foundation model (VFM)-based framework for shipborne RGB–IR object detection. Instead of full fine-tuning, it adapts frozen VFM representations via lightweight task-specific designs. the WS-R-IR Adapter includes (1) a water scene domain-aware modal adapter that progressively guides frozen backbone features with evolving semantic cues, (2) a parallel multi-scale structural perception module for fine-grained, scale-sensitive modeling, (3) an adaptive RGB–IR feature modulation fusion strategy, and (4) a resolution-aligned context semantic and structural detail fusion module. Moreover, we introduce an object-guided global-to-local registration framework to address dynamic cross-modal misalignment, and construct modality-aligned PoLaRIS-DET and ASV-RI-DET datasets that cover diverse water surface scenes. On the two datasets, the proposed method achieves mAP@0.5:0.95 scores of 74.2% and 50.2%, respectively, significantly outperforming existing methods with only 11.9M additional parameters. These results demonstrate the effectiveness of parameter-efficient VFM adaptation for multimodal water surface remote sensing.
Xue et al. (Fri,) studied this question.