What question did this study set out to answer?

The research aims to develop a machine-learning framework for predicting anticancer small molecules with enhanced accuracy and interpretability.

April 19, 2026Open Access

XGBPred-ACSM: A Hybrid Descriptor-Driven XGBoost Framework for Anticancer Small Molecule Prediction

Key Points

The research aims to develop a machine-learning framework for predicting anticancer small molecules with enhanced accuracy and interpretability.
Processed 3600 compounds with IC50 values for modeling.
Derived molecular representations using 2D descriptors and structural fingerprints.
Trained six different machine learning algorithms for comparative analysis.
Implemented 10-fold cross-validation for model evaluation and performance metrics.
Examined ensemble voting strategies to enhance prediction accuracy.
XGB-Hybrid architecture achieved an AUC of 0.88 and accuracy of 79.11% on the independent test set.
Feature contributions analyzed using SHAP, revealing key molecular determinants for anticancer activity.
Established the XGB-Hybrid framework as high-performance and interpretable for drug discovery.

Abstract

Background/Objectives: Cancer remains one of the leading global health burdens, mainly because of the lack of specificity and off-target toxicity associated with conventional therapeutic approaches. To move toward more efficient anticancer drug discovery, we have developed an advanced machine-learning-based architecture that allows for predictive modeling of anticancer small molecules. Methods: A total of 3600 compounds with experimentally validated IC50 values were systematically processed to derive a comprehensive suite of molecular representations comprising 2D physicochemical descriptors, structural fingerprints, and hybrid descriptor sets generated via the Mordred and PaDEL frameworks. A total of six machine learning algorithms—Random Forest (RF), Extreme Gradient Boosting (XGB), Gradient Boosting (GB), Extra-Trees classifier (ET), Adaptive Boosting (AdaBoost), and Light Gradient Boosting Machine (LightGBM)—were trained and benchmarked via a rigorous model evaluation protocol incorporating 10-fold cross-validation along with multiple performance metrics. Ensemble voting strategies were also examined to assess potential performance. Result: Of all configurations, the XGB-Hybrid architecture emerged as the most robust and generalizable classifier with an AUC of 0.88 and accuracy of 79.11% on the independent test set. To ensure interpretability and mechanistic insight, SHAP-based feature analysis was conducted, by which feature contributions could be quantified and the molecular determinants most influential for anticancer activity discrimination were revealed. Altogether, the current study establishes an XGB-Hybrid framework as technically rigorous, interpretable, and high-performance predictive modeling with the ability to accelerate early-stage anticancer small molecule identification. Conclusions: The study has brought into focus the transformational effect of machine learning in modern computational oncology and rational drug design pipelines.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper