Key points are not available for this paper at this time.
Predicting blood–brain barrier (BBB)-penetrating peptides remains critical for peptide-based central nervous system drug delivery, yet model performance depends strongly on data curation and feature representation. In this study, we constructed a benchmark dataset from publicly available resources by merging peptide records and removing duplicate sequences, resulting in 426 positive and 6,865 negative samples. Each peptide was encoded using fused representations that combine protein language model embeddings with physicochemical descriptors, yielding a 2,121-dimensional feature space. After variance filtering, standardization, and mutual-information-based feature selection, the top 700 features were retained for classification. To address class imbalance, the majority class in the training set was randomly undersampled to achieve a 1:5 positive-to-negative ratio. A foundation model for tabular classification, termed B3BPFN, was then trained on the processed feature matrix and evaluated on an independent balanced test set comprising 20% of the positive samples and an equal number of negative samples. The final model achieved a sensitivity of 0.9294, specificity of 0.8824, accuracy of 0.9059, Matthews correlation coefficient (MCC) of 0.8127, and area under the receiver operating characteristic curve (AUROC) of 0.9460. SHAP analysis further revealed that composition–transition–distribution (CTDD) descriptors serve as important features for BBB-penetrating peptide prediction. A user-friendly web server is freely available at https://ycclab.cuhk.edu.cn/b3bpfn to facilitate community use.
Liu et al. (Wed,) studied this question.