This paper extends the work of Bouazza et al. (IEEE, 2015), which applied K-Nearest Neighbor (KNN) and Support Vector Machine (SVM) classifiers with filter-based feature selection to classify gene expression microarray data for leukemia, prostate, and colon cancer. Our extension focuses exclusively on the Golub 1999 leukemia dataset (72 samples, 7,129 genes; ALL vs. AML classification) and makes four primary contributions over the base paper: (1) three additional classifiers — Logistic Regression (LR), Decision Tree (DT), and Naive Bayes (NB) — completing the five-classifier ML evaluation pipeline; (2) K-parameter optimization (K = 1–20) not present in the original; (3) 10-fold stratified cross-validation replacing the single train-test split; and (4) comprehensive evaluation metrics including Precision, Recall, F1-Score, and Confusion Matrix. Signal-to-Noise Ratio (SNR) feature selection retaining the top 30 genes is preserved from the original paper for reproducibility. On the 15-sample held-out test set, KNN (K=1), SVM, Logistic Regression, and Naive Bayes each achieved 100% accuracy; Decision Tree achieved 86.67% (F1: 83.33%). Under 10-fold cross-validation, SVM, Logistic Regression, and Naive Bayes achieved a mean accuracy of 95.71% ± 9.15%, confirming robust generalizability. PCA visualization confirms clear class separation (PC1 + PC2 = 62.5% variance). Results demonstrate that the original SNR-based methodology is reproducible and that multiple classifiers achieve equivalent discriminative performance on this dataset.
Abhishek Raj Urs K S (Sun,) studied this question.