Anticancer peptides (ACPs) are short chains of amino acids that carry out specific functions to destroy cancer cells while reducing harm to healthy tissue. However, the identification of ACPs is extremely costly and time-consuming, leading to the use of computational approaches. In this study, various machine learning pipelines were developed and evaluated to predict anticancer peptide function using sequence-derived features. A curated dataset of 6,785 labeled peptide sequences was analyzed using Orange Data Mining, with models based on features including amino acid composition, aggregated physicochemical properties, and k-mer sequence bag-of-words encoding. Classification models that were implemented included logistic regression, support vector machines, decision trees, random forests, gradient boosting, and neural networks. Model performance was assessed using stratified five-fold cross-validation, measured by metrics including accuracy, area under the receiver operating characteristic curve (AUC), precision, recall, F1-score, and Matthews correlation coefficient (MCC). Among the three main pipelines, physicochemical property-based models, particularly the random forest classifier, achieved the strongest performance, reaching approximately 89% accuracy and an AUC of 0.96. Sequence-based k-mer models and amino acid composition approaches also demonstrated decent predictive capability, with accuracy just under 80%, but did not outperform physicochemical representations. These findings indicate that biologically informed algorithms substantially improve anticancer peptide prediction and show the application of machine learning pipelines for accelerating peptide-based drug discovery.
Kiaan Saraiya (Wed,) studied this question.