April 24, 2026Open Access

MmSAM: multimodal meets SAM2 for efficient remote sensing semantic segmentation

Puntos clave

Los puntos clave no están disponibles para este artículo en este momento.

Resumen

Semantic segmentation plays a crucial role in numerous remote sensing (RS) applications. Despite the success of multimodal RS segmentation models, integrating large-scale visual priors from foundation models with multimodal information remains challenging due to modality incompatibility and increased computational costs. To address this, we propose MmSAM, an efficient fine-tuning framework that applies Segment Anything Model 2 (SAM2) to multimodal RS semantic segmentation. Unlike traditional feature fusion paradigms in multimodal segmentation, we do not treat additional modalities as equal inputs to the main modality but as prompts for it. We employ the Mixture-of-Experts (MoE) mechanism to construct hard- and soft-MoE as multimodal prompters, sparsifying the model architecture while extracting multimodal features, effectively controlling the computational load. Additionally, we introduce several fine-tuning methods to enhance the performance of the SAM2 image encoder and perform end-to-end modification to better adapt the model to downstream tasks. Experimental results on two public multimodal RS datasets demonstrate that MmSAM significantly outperforms the single-modal SAM2 baseline by ∼ 2.5% and ∼ 2.2% in mean intersection over union (mIoU), respectively. Furthermore, MmSAM achieves state-of-the-art performance with lower computational cost, making it highly suitable for consumer-level deployments. The code will be available at: https://github.com/W-qp/MmSAM.

Leer artículo completoexternamente

Me gusta

Guardar

Ver artículo completo