Semantic segmentation of Unmanned Aerial Vehicle (UAV) imagery remains a formidable task owing to extreme scale variation, cluttered backgrounds, and oblique viewing geometry. Existing approaches suffer from three interrelated limitations: (i) single-attention architectures employ only one attention paradigm throughout the decoder, inherently restricting the network to either global context or fine-grained local detail but not both; (ii) homogeneous multi-branch designs replicate the same attention type across branches, which increases computation without introducing representational diversity; and (iii) prevailing feature fusion strategies—element-wise addition and concatenation—lack channel-level adaptivity, failing to exploit the complementary strengths of heterogeneous feature sources. To overcome this bottleneck, we propose HAFNet (Heterogeneous Attention Fusion Network), which introduces a multi-branch decoder in which four parallel branches—employing multi-head, spatial, self-, and shifted-window attention, respectively—decode shared encoder features concurrently. A squeeze-and-excitation (SE) enhanced aggregation module then adaptively recalibrates and fuses the branch outputs at the channel level, enabling the network to leverage the complementary strengths of diverse attention mechanisms within a single forward pass. Extensive experiments on four public benchmarks demonstrate that HAFNet establishes new state-of-the-art results, achieving 72.1% mIoU on UAVid, 84.5% on ISPRS Vaihingen, 88.2% on ISPRS Potsdam, and 54.8% on LoveDA, surpassing the latest competing methods including UrbanSSF-L and UNetFormer. Ablation studies further verify that each branch provides unique and complementary representations; the full four-branch configuration consistently outperforms every subset, yielding especially pronounced improvements on small-scale objects (+18.3% F1 for cars) and heterogeneous land-cover categories.
Li et al. (Mon,) studied this question.