What question did this study set out to answer?

To develop a machine learning framework for the rapid prediction of cell-penetrating peptides based on physicochemical properties.

February 14, 2026Open Access

Predicting Cell Penetrating Peptides Using Aggregating Physiochemical Properties and Classification Algorithms

Key Points

To develop a machine learning framework for the rapid prediction of cell-penetrating peptides based on physicochemical properties.
Utilized five supervised learning algorithms: logistic regression, support vector machine, decision tree, random forest, and gradient boosting.
Analysed a dataset of 5,102 peptides with a 2.5:1 ratio of non-CPPs to CPPs using stratified 10-fold cross-validation.
Represented each peptide as a vector by aggregating 13 physicochemical properties from amino acid sequences.
Applied minor oversampling to address class imbalance during model training.
Random forest model achieved the best performance across evaluation metrics.
Gradient boosting closely followed in performance.
Logistic regression and decision tree classifiers demonstrated moderate performance.
Support vector machine classifier performed poorly, resembling random classification.

Abstract

Cell-penetrating peptides (CPPs) are short peptides capable of traversing cell membranes without damaging them, making them valuable for drug delivery and similar applications. However, experimental identification of CPPs is time-consuming and expensive, leading computational screening approaches to be the ideal choice. This study presents a machine learning framework for predicting CPPs using aggregated physicochemical properties from amino acid sequences. We evaluated five supervised learning algorithms: logistic regression, support vector machine, decision tree, random forest, and gradient boosting on a dataset of 5,102 peptides, with a 2.5:1 ratio of non-CPPs to CPPs using stratified 10-fold cross-validation. Each peptide was represented as a vector with 13 properties by summing physicochemical property values across the component amino acids and applying standardization. To address the significant class imbalance, minor oversampling was applied within each non-validation fold. Random forest achieved the best performance across all metrics, with gradient boosting trailing closely behind. Logistic regression and decision tree classifiers showed moderate performance, but SVM exhibited near-random classification, which was the opposite of previous CPP prediction studies. This framework achieves performance superior, at least comparable, to position-specific methods while using significantly fewer features, supporting the hypothesis and showing its usefulness for rapid CPP screening drug delivery research.

Bookmark

View Full Paper

Bookmark

View Full Paper

Predicting Cell Penetrating Peptides Using Aggregating Physiochemical Properties and Classification Algorithms

Key Points

Abstract

Cite This Study