Accurate semantic segmentation of remote sensing imagery requires both fine-grained boundary modelling and long-range contextual reasoning. To address this challenge, we propose TransDeepUNet, a hierarchical multimodal fusion network that integrates RGB imagery and DSM elevation data. The framework employs a parameter-sharing dual-branch encoder to preserve modality-specific representations. A shallow cross-modal attention module enhances structural details, while a deep cross-modal Transformer models global dependencies and semantic alignment. A cascaded decoder progressively reconstructs high-resolution segmentation maps. Experiments on the ISPRS Vaihingen and Potsdam datasets and a high-resolution Swiss dataset demonstrate consistent performance improvements over strong CNN and hybrid baselines. On the Potsdam dataset, TransDeepUNet achieves an mIoU of 85.64% and an mF1-score of 92.07%, outperforming comparable multimodal models while maintaining competitive computational complexity. The code is publicly available at: https://github.com/yingning01/TransDeepUNet.
Wang et al. (Sun,) studied this question.