What question did this study set out to answer?

The study aims to evaluate the effectiveness of various machine learning models in identifying neuropeptides from peptide sequences.

February 14, 2026Open Access

A Comparative Analysis of Machine Learning Models for Neuropeptide Identification

Key Points

The study aims to evaluate the effectiveness of various machine learning models in identifying neuropeptides from peptide sequences.
Used a dataset of 4,870 peptide sequences divided into positive and negative classes.
Applied features like amino acid composition and peptide length in the analysis.
Utilized the Orange3 data mining platform for efficient workflow.
Trained models including logistic regression, decision trees, support vector machines, random forests, and gradient boosting.
Evaluated model performance using metrics like accuracy, precision, recall, F1 score, and area under the curve (AUC).
Gradient boosting and random forests showed the highest accuracy in identifying neuropeptides.
Models using a combination of features delivered strong predictive performance.
Amino acid sequence information, paired with k-mer techniques, provided sufficient data for model efficacy.

Abstract

Neuropeptides play many roles. While mainly present in glial cells, they act as neurotransmitter peptides in the endocrine system and as hormonal peptides in the immune system. Recent peptide research has been ground-breaking, but neither cost nor time-efficient. In this study, machine learning techniques were applied to explore whether peptide sequence information was sufficient to identify neuropeptides from non-neuropeptides. Receiving a dataset consisting of 4,870 peptide sequences, divided equally into each positive and negative class, was analyzed using amino acid composition, peptide length, and k-mer bag-of-words. The research and workflow were created on the Orange3 data mining platform, which allowed time-efficient evaluation. The learning models applied in this study were logistic regression, decision trees, support vector machines, random forests, and gradient boosting, and all were trained with the data to reveal their accuracy in identifying neuropeptides. Model performance was calculated using accuracy, precision, recall, F1 score, and area under the curve (AUC). The results showed that models utilizing a combination of features achieved strong predictive performance, with the best accuracy received from gradient boosting and random forests. These findings indicate that amino acid sequence information combined with the k-mer bag-of-words approach provided sufficient results to assume the machine learning (ML) models gradient boosting and random forest can accurately distinguish neuropeptides from non-neuropeptides. This study demonstrates the value of using ML for peptide research and showcases the potential of accelerating neuropeptide discovery.

Bookmark

View Full Paper

Bookmark

View Full Paper

A Comparative Analysis of Machine Learning Models for Neuropeptide Identification

Key Points

Abstract

Cite This Study