The leukemia subtype/risk prediction is still a major problem because multi-omics data are highly dimensional and heterogeneous. To overcome this problem, this paper suggests a new machine learning model, namely CCA-RFE selector (CCARS), to effectively combine multi-omics data and to select features. The proposed scheme involves the canonical correlation analysis (CCA) to identify correlated features between layers of omics and recursive feature elimination (RFE) to progressively narrow down on the most informative features. The evaluation of the framework uses publicly available leukemia multi-omics data acquired at TCGA-LAML and GEO (GSE37642). The evaluation of performance is done through nested cross-validation through AUC-ROC, PR-AUC, accuracy, and F1-score. The experimental findings indicate that the suggested CCARS framework is prone to better performance in contrast with baseline methods such as PCA, lasso regression, and CCA. In particular, CCARS scored 90% classification accuracy and an F1-score of 0.85, compared to the existing models, and with reasonable computation time. The findings show that the framework proposed is validated on an independent gene expression dataset to assess partial generalization and can be used to diagnose AML risk classification and discover biomarkers.
Saini et al. (Sat,) studied this question.