What question did this study set out to answer?

May 15, 2026Open Access

Random forest imputation and genomic prediction for missing egg production time-series data in yellow-feathered broiler breeders

Key Points

This study aims to evaluate imputation strategies for missing egg production time-series data and their impact on genomic prediction accuracy.
Compared six imputation strategies: Forward–Backward Mean, piecewise linear regression, spline regression, K-Nearest Neighbors, Random Forest, and LSTM.
Utilized data from 4,390 yellow-feathered broilers, with over 100,000 egg production records and 463,000 SNPs.
Assessed imputation accuracy through RMSE, MSE, and R2 metrics and evaluated downstream genomic prediction accuracy using GEBV.
Random Forest outperformed other methods, reducing RMSE by 15.50% – 31.22% and MSE by 28.31% – 52.69%; improving R2 by 7.42% – 23.41%.
Genomic Estimated Breeding Values (GEBV) improved from 0.239 – 0.277 without imputation to 0.288 – 0.293 with imputation.
Random Forest delivered accuracy improvements for GEBV of 5.78% to 21.76% (p < 0.05).

Abstract

Egg production is a key economic trait in poultry breeding, and its longitudinal records are essential for accurate genetic evaluation. However, time-series egg production data often contain missing values due to sensor failure, recording interruption, or operational error. Such missingness not only reduces the accuracy of phenotypic reconstruction but may also affect downstream breeding-value estimation. This study aimed to compare multiple imputation strategies for missing egg production time-series data and to evaluate them from two complementary perspectives: imputation accuracy and downstream genomic and pedigree-based prediction performance. We benchmarked six imputation strategies—Forward–Backward Mean, piecewise linear regression, spline regression, K-Nearest Neighbors, Random Forest (RF), and Long Short-Term Memory (LSTM) networks—using datasets from 4,390 yellow-feathered broilers, encompassing more than 100,000 egg production records and 463,000 SNPs. In terms of imputation accuracy, RF consistently outperformed the alternative methods across simulated missingness rates of 5%, 10%, 15%, and 20%, reducing RMSE by 15.50% – 31.22% and MSE by 28.31% – 52.69%, while improving R2 by 7.42% – 23.41% relative to the other methods. In downstream evaluation, the accuracy of Genomic Estimated Breeding Values (GEBV) without imputation ranged from 0.239 to 0.277, whereas imputation improved it to 0.288 – 0.293. The Random Forest method emerged as the most robust approach, delivering significant accuracy improvements of 5.78% to 21.76% (p < 0.05). This study demonstrates that Random Forest imputation is a highly effective tool for resolving missing data challenges in egg production time-series. By bridging the gap between raw data processing and genomic prediction, these findings provide a practical computational framework for enhancing the reliability of breeding programs in the poultry industry and other livestock species with longitudinal data.

Bookmark

View Full Paper

Bookmark

View Full Paper

Random forest imputation and genomic prediction for missing egg production time-series data in yellow-feathered broiler breeders

Key Points

Abstract

Cite This Study