Protein-based molecular biomarkers play an important role in prognostic modeling and risk stratification in precision medicine. However, longitudinal survival studies involving high-dimensional biomarker data are frequently challenged by pervasive missingness and limited sample sizes, which can compromise model stability and interpretability. In this study, we present and evaluate a reproducible analytical pipeline for survival risk prediction that integrates established methods for missing data handling, feature selection, and time-to-event modeling. Missing values are addressed using an unsupervised random forest-based imputation approach that leverages internal covariate structure without incorporating outcome information, thereby reducing the risk of information leakage. Feature dimensionality is subsequently reduced using penalized Cox regression with the least absolute shrinkage and selection operator, followed by refinement and stability assessment using random survival forests to capture nonlinear effects and interactions. The final set of selected biomarkers is examined using univariate and multivariable Cox proportional hazards models to support clinical interpretability and risk stratification. Using a publicly available proteomic dataset from cancer patients, we demonstrate how this sequential modeling strategy can identify stable prognostic biomarkers while highlighting the challenges of overfitting in small-sample, high-dimensional survival settings. The proposed workflow serves as a practical and transparent framework for biomarker-driven survival analysis rather than a new statistical methodology.
Kumar et al. (Sun,) studied this question.