What question did this study set out to answer?

May 16, 2026Open Access

A study on synthetic-to-real generalization of depth estimation and semantic segmentation joint networks on aerial images

Key Points

This work aims to evaluate factors affecting the synthetic-to-real generalization in depth estimation and semantic segmentation networks using UAV images.
Utilized Co-SemDepth and TaskPrompter models for comparative analysis.
Introduced a new synthetic dataset called TopAir to address the lack of annotated aerial data.
Analyzed impacts of model architecture, synthetic training data, and few-shot learning on performance.
Co-SemDepth demonstrated superior performance in depth estimation compared to TaskPrompter.
TaskPrompter excelled in semantic segmentation, particularly with rural datasets.
Networks trained on MidAir and TopAir datasets showed better generalization to rural scenes, while those trained on SkyScenes and SynDrone excelled in urban scenes.

Abstract

Abstract The use of deep networks for monocular depth estimation and semantic segmentation is widely expanding. The training of such networks requires an abundance of annotated data. However, in the unmanned aerial vehicle (UAV) field, the availability of such data is limited due to the specificity of the domain and the burden of the annotation process. Simulation engines allow us to collect annotated data automatically with minimal effort. Consequently, using synthetic data for the training of neural networks is convenient, but it raises issues when shifting to the real domain. In this work, an extensive analytical study is conducted to assess the effect of several factors (model architecture, synthetic training data, and few-shot learning) on the synthetic-to-real generalization in depth estimation and semantic segmentation of real UAV images. Co-SemDepth (AlaaEldin and Odone in SAC 2026 conference) and TaskPrompter (Ye and Xu in The eleventh international conference on learning representations, 2022) models are used for comparison in this study. To the best of our knowledge, this is the first synthetic-to-real study that adopts a big variety of datasets in the analysis, and it is the first one addressing synthetic-to-real depth estimation in UAVs. In addition, a new synthetic dataset is introduced, TopAir ¹ 1, helping to fill the gap of the scarcity of annotated datasets in the aerial field. The results reveal a superior generalization performance for Co-SemDepth in depth estimation and for TaskPrompter in semantic segmentation. The results also show better generalization to the real data containing rural scenes for the networks trained on MidAir (Fonder and Van Droogenbroeck, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, 2019) and TopAir datasets, while better generalization to urban scenes was achieved using the networks trained on SkyScenes (Khose et al. in European conference on computer vision, Springer, pp 19–35, 2024) and SynDrone (Rizzoli et al. in Proceedings of the IEEE/CVF international conference on computer vision, pp 2210–2220, 2023). The few-shot learning generally improved the outcomes, and a visualization of the 3D semantic maps using the predictions is presented.

Bookmark

View Full Paper

Bookmark

View Full Paper

A study on synthetic-to-real generalization of depth estimation and semantic segmentation joint networks on aerial images

Key Points

Abstract

Cite This Study