What question did this study set out to answer?

The aim is to develop and evaluate machine learning models to predict the function of anticancer peptides using sequence-derived features.

February 13, 2026Open Access

Predicting Anticancer Peptides through Physicochemical and Sequence-Based Machine Learning

Key Points

The aim is to develop and evaluate machine learning models to predict the function of anticancer peptides using sequence-derived features.
Developed various machine learning pipelines to predict anticancer peptide function.
Analyzed a curated dataset of 6,785 labeled peptide sequences.
Implemented classification models such as logistic regression, support vector machines, and random forests.
Assessed model performance using stratified five-fold cross-validation and various metrics.
Physicochemical property-based models achieved approximately 89% accuracy and an AUC of 0.96.
Random forest classifier outperformed other models.
Sequence-based k-mer models achieved just under 80% accuracy but were less effective than physicochemical models.

Abstract

Anticancer peptides (ACPs) are short chains of amino acids that carry out specific functions to destroy cancer cells while reducing harm to healthy tissue. However, the identification of ACPs is extremely costly and time-consuming, leading to the use of computational approaches. In this study, various machine learning pipelines were developed and evaluated to predict anticancer peptide function using sequence-derived features. A curated dataset of 6,785 labeled peptide sequences was analyzed using Orange Data Mining, with models based on features including amino acid composition, aggregated physicochemical properties, and k-mer sequence bag-of-words encoding. Classification models that were implemented included logistic regression, support vector machines, decision trees, random forests, gradient boosting, and neural networks. Model performance was assessed using stratified five-fold cross-validation, measured by metrics including accuracy, area under the receiver operating characteristic curve (AUC), precision, recall, F1-score, and Matthews correlation coefficient (MCC). Among the three main pipelines, physicochemical property-based models, particularly the random forest classifier, achieved the strongest performance, reaching approximately 89% accuracy and an AUC of 0.96. Sequence-based k-mer models and amino acid composition approaches also demonstrated decent predictive capability, with accuracy just under 80%, but did not outperform physicochemical representations. These findings indicate that biologically informed algorithms substantially improve anticancer peptide prediction and show the application of machine learning pipelines for accelerating peptide-based drug discovery.

Bookmark

View Full Paper

Bookmark

View Full Paper

Predicting Anticancer Peptides through Physicochemical and Sequence-Based Machine Learning

Key Points

Abstract

Cite This Study