August 12, 2024Open Access

A Transformer-based multi-modal fusion network for semantic segmentation of high-resolution remote sensing imagery

Key Points

Key points are not available for this paper at this time.

Abstract

Semantic segmentation of high-resolution multispectral remote sensing image has been intensely studied. However, the shadow occlusions, or the similar color and textures, between the categories influence the segmentation accuracy. Concomitantly, the size of targets in the remote sensing images is diverse and the network cannot balance their segmentation. This paper introduces a network, Transformer-based Multi-modal Fusion Network (TMFNet), which fuses the multi-modal features and incorporates height features from the digital surface model (DSM) to supplement the extra different features between each category. Particularly, we introduce two parallel encoders to extract the features from different modalities, a Multi-Modal fusion model based on the Transformer (MMformer) to complete the multi-modal fusion, and a Border Region Attention based multi-level Fusion Module (BRAFM) to integrate the cross-level features and enhance the small target segmentation by utilizing the details around the border. The experiment results on the ISPRS Vaihingen and Potsdam benchmark datasets indicate that the proposed TMFNet outperforms the SOTA methods on the segmentation performance.

Read Full Paperexternally

KI fragen

Bookmark

View Full Paper