Cell-penetrating peptides (CPPs) are short peptides capable of traversing cell membranes without damaging them, making them valuable for drug delivery and similar applications. However, experimental identification of CPPs is time-consuming and expensive, leading computational screening approaches to be the ideal choice. This study presents a machine learning framework for predicting CPPs using aggregated physicochemical properties from amino acid sequences. We evaluated five supervised learning algorithms: logistic regression, support vector machine, decision tree, random forest, and gradient boosting on a dataset of 5,102 peptides, with a 2.5:1 ratio of non-CPPs to CPPs using stratified 10-fold cross-validation. Each peptide was represented as a vector with 13 properties by summing physicochemical property values across the component amino acids and applying standardization. To address the significant class imbalance, minor oversampling was applied within each non-validation fold. Random forest achieved the best performance across all metrics, with gradient boosting trailing closely behind. Logistic regression and decision tree classifiers showed moderate performance, but SVM exhibited near-random classification, which was the opposite of previous CPP prediction studies. This framework achieves performance superior, at least comparable, to position-specific methods while using significantly fewer features, supporting the hypothesis and showing its usefulness for rapid CPP screening drug delivery research.
Srinandasai Ari (Sat,) studied this question.