Abstract The use of deep networks for monocular depth estimation and semantic segmentation is widely expanding. The training of such networks requires an abundance of annotated data. However, in the unmanned aerial vehicle (UAV) field, the availability of such data is limited due to the specificity of the domain and the burden of the annotation process. Simulation engines allow us to collect annotated data automatically with minimal effort. Consequently, using synthetic data for the training of neural networks is convenient, but it raises issues when shifting to the real domain. In this work, an extensive analytical study is conducted to assess the effect of several factors (model architecture, synthetic training data, and few-shot learning) on the synthetic-to-real generalization in depth estimation and semantic segmentation of real UAV images. Co-SemDepth (AlaaEldin and Odone in SAC 2026 conference) and TaskPrompter (Ye and Xu in The eleventh international conference on learning representations, 2022) models are used for comparison in this study. To the best of our knowledge, this is the first synthetic-to-real study that adopts a big variety of datasets in the analysis, and it is the first one addressing synthetic-to-real depth estimation in UAVs. In addition, a new synthetic dataset is introduced, TopAir ¹ 1, helping to fill the gap of the scarcity of annotated datasets in the aerial field. The results reveal a superior generalization performance for Co-SemDepth in depth estimation and for TaskPrompter in semantic segmentation. The results also show better generalization to the real data containing rural scenes for the networks trained on MidAir (Fonder and Van Droogenbroeck, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2019) and TopAir datasets, while better generalization to urban scenes was achieved using the networks trained on SkyScenes (Khose et al. in European conference on computer vision, Springer, pp 19–35, 2024) and SynDrone (Rizzoli et al. in Proceedings of the IEEE/CVF international conference on computer vision, pp 2210–2220, 2023). The few-shot learning generally improved the outcomes, and a visualization of the 3D semantic maps using the predictions is presented.
AlaaEldin et al. (Thu,) studied this question.