Monocular height estimation from remote sensing images plays a crucial role in urban planning, 3D reconstruction, and environmental monitoring. However, existing monocular height estimation networks primarily rely on implicit semantic features in remote sensing images, while neglecting the geometric structural information of ground objects. This limitation reduces the edge-preserving capability and physical plausibility of ground objects in complex scenes. To address this issue, we propose a Geometry-Aware Dense Prediction Network for Monocular Height Estimation from Remote Sensing Images (GA-DPNet). Firstly, building upon the introduction of DINOv2 as the encoder network, the Multi-Scale Geometric Alignment (MSGA) module is designed to reduce the inconsistency in global geometric space among features extracted at different levels by DINOv2. Secondly, the Geometry-Aware Feature Fusion Block (GAFFB) is designed, which includes Geometric Feature Extractor (GFE), Geometry-Aware Attention module (GAA), and Geometric Modulated Residual module (GMR). By extracting four geometric features including gradient, curvature, planarity, and edges to modulate attention weights, GAFFB improves the effectiveness of fusing multi-scale geometric information with semantic features during the decoding process. Finally, the Multi-component Geometric Constraint Loss (MGCL) function is designed, including geometric consistency loss and physical constraint loss, to enhance the geometric plausibility of the network’s predictions. Experimental results on three public datasets, Vaihingen, Potsdam, and DFC2019, show that GA-DPNet achieves MAEs of 0.783, 1.022, and 0.835, respectively, demonstrating superior performance in terms of height estimation accuracy and edge preservation.
Yang et al. (Mon,) studied this question.