What question did this study set out to answer?

This research aims to develop a machine learning framework to predict the permeability of the blood-brain barrier from chemical data.

June 5, 2026Open Access

Predicting Blood–Brain Barrier Permeability from Experimental Data: An Interpretable and Externally Validated Machine Learning Framework

Key Points

This research aims to develop a machine learning framework to predict the permeability of the blood-brain barrier from chemical data.
Utilized B3DB experimental database with 7807 chemicals and 1058 compounds with in vivo log BB values.
Calculated 40 two-dimensional chemical descriptors using the Mordred library from SMILES notation without artificial data augmentation.
Applied stratified five-fold cross-validation to benchmark nine machine learning methods.
The best regression performance was achieved by gradient boosting with R2 = 0.6043, RMSE = 0.4740 log units, and MAE = 0.3326 for the held-out test set (n = 212).
For the internal test set (n = 1562), the model achieved an AUC-ROC of 0.9476 and a balanced accuracy of 0.8568.
On the external validation set (n = 175), the model produced an AUC-ROC of 0.9137, indicating strong predictive validity.

Abstract

Background: The blood–brain barrier (BBB), which restricts the brain penetration of most small molecules and almost all biologics, continues to be a significant hurdle in the development of drugs for the central nervous system (CNS). During early-stage screening, a reliable computational prediction of BBB permeability, typically expressed as log BB, can help reduce the experimental load. Methods: We provide a well-validated machine learning system created solely using the B3DB experimental database, which includes 7807 chemicals with BBB+/BBB− annotations and 1058 compounds with in vivo log BB values. Using the Mordred library, a carefully selected set of 40 two-dimensional chemical descriptors was calculated from SMILES notation without the use of artificial data augmentation. Stratified five-fold cross-validation was used to comprehensively benchmark the nine methods used in this study. Results: On a held-out test set (n = 212), gradient boosting produced the greatest regression performance, with R2 = 0.6043, RMSE = 0.4740 log units, and MAE = 0.3326, which is in line with the upper range recorded for experimental BBB datasets. On an internal test set (n = 1562), the corresponding classifier obtained an AUC-ROC of 0.9476 and a balanced accuracy of 0.8568; on an independent external validation set (n = 175), it achieved an AUC-ROC of 0.9137. Topological polar surface area was found by SHAP analysis to be the primary factor influencing BBB permeability, with lipophilicity and ionization-related characteristics being the second and third most important factors, respectively. Nonlinear relationships in accordance with accepted pharmacokinetic principles were validated using partial dependence analysis. Conclusion: This study provides a reliable technique for predicting BBB permeability in CNS drug discovery.

Read Full Paperexternally

Mark Helpful

Bookmark

Relay

View Full Paper