What question did this study set out to answer?

This research aims to evaluate and compare five machine learning classifiers for classifying leukemia using gene expression data.

June 2, 2026Open Access

Extended Gene-Expression-Based Leukemia Classification: A Comparative Evaluation of Five Machine Learning Classifiers With SNR Feature Selection and Cross-Validation

Key Points

This research aims to evaluate and compare five machine learning classifiers for classifying leukemia using gene expression data.
Analyzed the Golub 1999 leukemia dataset with 72 samples and 7,129 genes for ALL vs. AML classification.
Employed K-Nearest Neighbor, Support Vector Machine, Logistic Regression, Decision Tree, and Naive Bayes classifiers with SNR feature selection.
Utilized 10-fold stratified cross-validation for performance evaluation.
KNN, SVM, Logistic Regression, and Naive Bayes achieved 100% accuracy on the 15-sample held-out test set; Decision Tree achieved 86.67% accuracy (F1: 83.33%).
Under 10-fold cross-validation, SVM, Logistic Regression, and Naive Bayes showed mean accuracy of 95.71% ± 9.15%.
PCA visualization indicated clear class separation with 62.5% variance.

Abstract

This paper extends the work of Bouazza et al. (IEEE, 2015), which applied K-Nearest Neighbor (KNN) and Support Vector Machine (SVM) classifiers with filter-based feature selection to classify gene expression microarray data for leukemia, prostate, and colon cancer. Our extension focuses exclusively on the Golub 1999 leukemia dataset (72 samples, 7,129 genes; ALL vs. AML classification) and makes four primary contributions over the base paper: (1) three additional classifiers — Logistic Regression (LR), Decision Tree (DT), and Naive Bayes (NB) — completing the five-classifier ML evaluation pipeline; (2) K-parameter optimization (K = 1–20) not present in the original; (3) 10-fold stratified cross-validation replacing the single train-test split; and (4) comprehensive evaluation metrics including Precision, Recall, F1-Score, and Confusion Matrix. Signal-to-Noise Ratio (SNR) feature selection retaining the top 30 genes is preserved from the original paper for reproducibility. On the 15-sample held-out test set, KNN (K=1), SVM, Logistic Regression, and Naive Bayes each achieved 100% accuracy; Decision Tree achieved 86.67% (F1: 83.33%). Under 10-fold cross-validation, SVM, Logistic Regression, and Naive Bayes achieved a mean accuracy of 95.71% ± 9.15%, confirming robust generalizability. PCA visualization confirms clear class separation (PC1 + PC2 = 62.5% variance). Results demonstrate that the original SNR-based methodology is reproducible and that multiple classifiers achieve equivalent discriminative performance on this dataset.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper