This project compares several featurization strategies and machine-learning classification models for predicting whether peptides are pro-inflammatory. Because the training dataset was highly imbalanced, up-sampling was applied to increase the minority (positive) class. Three feature representations were explored: amino-acid composition, amino-acid physicochemical characteristics, and k-mer (bag-of-words) representations. For each featurization, multiple classification models were trained and evaluated, including logistic regression, support vector machines, decision tree, random forest, gradient boosting, AdaBoost, and a neural network model. Model performance was compared using confusion matrices and ROC curves. Across all approaches, Random Forest consistently produced the highest AUC values, indicating superior predictive power compared with the other models, due to a combination of method sophistication and performance in over-training regimes.
Olivia Plaku (Sat,) studied this question.