What question did this study set out to answer?

The research aims to improve survival risk prediction through enhanced handling of high-dimensional protein biomarker data.

March 24, 2026Open Access

Enhancing survival risk prediction through imputation and feature selection in high-dimensional protein biomarker data

Key Points

The research aims to improve survival risk prediction through enhanced handling of high-dimensional protein biomarker data.
Developed a reproducible analytical pipeline for survival risk prediction.
Used random forest-based imputation to address missing biomarker data.
Applied penalized Cox regression for feature selection.
Utilized random survival forests for modeling nonlinear effects and interactions.
Examined selected biomarkers with Cox proportional hazards models for clinical interpretability.
Identified stable prognostic biomarkers from high-dimensional data.
Showed reduced risk of overfitting while handling missing data.
Demonstrated the effectiveness of a clear workflow for biomarker-driven analyses.

Abstract

Protein-based molecular biomarkers play an important role in prognostic modeling and risk stratification in precision medicine. However, longitudinal survival studies involving high-dimensional biomarker data are frequently challenged by pervasive missingness and limited sample sizes, which can compromise model stability and interpretability. In this study, we present and evaluate a reproducible analytical pipeline for survival risk prediction that integrates established methods for missing data handling, feature selection, and time-to-event modeling. Missing values are addressed using an unsupervised random forest-based imputation approach that leverages internal covariate structure without incorporating outcome information, thereby reducing the risk of information leakage. Feature dimensionality is subsequently reduced using penalized Cox regression with the least absolute shrinkage and selection operator, followed by refinement and stability assessment using random survival forests to capture nonlinear effects and interactions. The final set of selected biomarkers is examined using univariate and multivariable Cox proportional hazards models to support clinical interpretability and risk stratification. Using a publicly available proteomic dataset from cancer patients, we demonstrate how this sequential modeling strategy can identify stable prognostic biomarkers while highlighting the challenges of overfitting in small-sample, high-dimensional survival settings. The proposed workflow serves as a practical and transparent framework for biomarker-driven survival analysis rather than a new statistical methodology.

Bookmark

View Full Paper

Bookmark

View Full Paper

Enhancing survival risk prediction through imputation and feature selection in high-dimensional protein biomarker data

Key Points

Abstract

Cite This Study