What question did this study set out to answer?

The goal is to evaluate various data imputation methods to improve the accuracy of photometric redshift estimations.

April 22, 2026

Comparative analysis of missing data imputation methods for CSST survey: Impact on photometric redshift estimation performance

Key Points

The goal is to evaluate various data imputation methods to improve the accuracy of photometric redshift estimations.
Evaluated multiple machine learning and deep learning models, including KNN and SAITS.
Applied models to mock data from the China Space Station Survey Telescope.
Assessed performance under various missing data conditions, including MCAR and non-random missingness.
KNN showed the highest accuracy under idealized MCAR conditions.
SAITS outperformed KNN in scenarios with incomplete training data or mixed missingness.
Domain consistency between training and testing data was crucial for optimal model performance.

Abstract

Improving the accuracy of photometric redshifts (photo-z) is essential for reliable statistical studies of cosmology and galaxy evolution. However, missing photometric bands are a common observational challenge that can significantly degrade photo-z estimation accuracy. In this work, we present a systematic evaluation of data imputation methods aimed at improving photo-z performance. We benchmark a range of representative machine learning and deep learning architectures, identifying k-nearest neighbors (KNN) and the attention-based SAITS model as the leading performers. These models are then applied to China Space Station Survey Telescope mock data to assess their performance under realistic observational conditions. Our results show that KNN yields the highest accuracy under idealized missing completely at random (MCAR) conditions with complete training sets, whereas robustness tests reveal that SAITS significantly outperforms KNN when training data are incomplete or when applied to realistic mixed-mechanism scenarios. We find that domain consistency between training and testing missingness patterns is a prerequisite for optimal performance, highlighting the risks of domain shift in supervised regression tasks. Furthermore, our analysis demonstrates that while general imputation models are highly effective for MCAR and missing at random data, they are detrimental when applied to missing not at random data arising from flux limits, as statistical models fail to capture the physical information inherent in these nondetection. Consequently, we advocate for more sophisticated architectures capable of disentangling stochastic missingness from physical nondetection to address these distinct mechanisms individually.

Bookmark

Comparative analysis of missing data imputation methods for CSST survey: Impact on photometric redshift estimation performance

Key Points

Abstract

Cite This Study