Chronic inflammation can be harmful in diseases such as asthma, eczema, arthritis, and cardiovascular disease. Traditional methods of treatment, such as non-steroidal anti-inflammatory drugs (NSAIDs) available over-the-counter, come with long-term risks, which creates the need for new therapeutic agents. One of them is anti-inflammatory peptides, short sequences of amino acids that help inhibit inflammatory pathways. Anti-inflammatory peptides are rare and can take hours of expensive lab synthesis and biological validation to discover, but through machine learning, this process can be shortened and made cheaper. This study used a CSV file as a dataset of 4194 binarily-labeled peptide sequences, provided by George Mason University’s Young Scholars Research NextGen Science: Machine Learning & Bioinformatics program, imported into Orange Data Mining software for n-grams featurization, training, and 10-fold cross-validation for testing. AUC was used as a primary evaluation metric due to its insensitivity to the class imbalance present in this sparse dataset. Logistic Regression had the most robust classification performance and supported the hypothesis with an AUC of 0.813, well over 0.7. Balanced random forest also supported the hypothesis with a comparable but slightly lower AUC of ≈0.757. Classifiers such as SVM had the lowest AUC of ≈0.529, likely due to noise created by the high-dimensional sparse data, which linear models, such as logistic regression, as mentioned, are better at classifying. This study shows that machine learning models trained and tested on sequence-derived n-gram features alone can be enough to make discriminative predictions to discover therapeutic peptides faster, saving time while cutting costs.
Shahan Sheru (Fri,) studied this question.