Neuropeptides play many roles. While mainly present in glial cells, they act as neurotransmitter peptides in the endocrine system and as hormonal peptides in the immune system. Recent peptide research has been ground-breaking, but neither cost nor time-efficient. In this study, machine learning techniques were applied to explore whether peptide sequence information was sufficient to identify neuropeptides from non-neuropeptides. Receiving a dataset consisting of 4,870 peptide sequences, divided equally into each positive and negative class, was analyzed using amino acid composition, peptide length, and k-mer bag-of-words. The research and workflow were created on the Orange3 data mining platform, which allowed time-efficient evaluation. The learning models applied in this study were logistic regression, decision trees, support vector machines, random forests, and gradient boosting, and all were trained with the data to reveal their accuracy in identifying neuropeptides. Model performance was calculated using accuracy, precision, recall, F1 score, and area under the curve (AUC). The results showed that models utilizing a combination of features achieved strong predictive performance, with the best accuracy received from gradient boosting and random forests. These findings indicate that amino acid sequence information combined with the k-mer bag-of-words approach provided sufficient results to assume the machine learning (ML) models gradient boosting and random forest can accurately distinguish neuropeptides from non-neuropeptides. This study demonstrates the value of using ML for peptide research and showcases the potential of accelerating neuropeptide discovery.
Jasmine Arora (Sat,) studied this question.