What question did this study set out to answer?

The study focuses on enhancing B-cell epitope prediction accuracy for Influenza A using machine learning and biologically relevant negative samples.

May 7, 2026Open Access

The Impact of Biologically Relevant Negative Samples on Machine Learning-Based B-Cell Epitope Prediction for Influenza A

Key Points

The study focuses on enhancing B-cell epitope prediction accuracy for Influenza A using machine learning and biologically relevant negative samples.
Developed a machine learning framework utilizing physicochemical descriptors for prediction.
Constructed a curated dataset including validated epitopes and non-epitopes.
Evaluated five supervised classifiers to assess the impact of negative sample quality.
Performed feature selection using Analysis of Variance and Mutual Information.
The Random Forest model achieved 82% accuracy and 83% F1-score.
Matthews correlation coefficient of 0.65 and area under the curve of 0.90 was observed.
Demonstrated improved performance compared to existing tools on the curated dataset.

Abstract

Abstract Motivation Predicting linear B-cell epitopes remains a major challenge in immunoinformatics, particularly for rapidly evolving viruses such as Influenza A. Many existing predictors rely on heterogeneous training datasets, poorly defined negative samples, or low-interpretability models, which can limit performance on pathogen-specific tasks. Improving prediction therefore requires biologically meaningful datasets together with informative and interpretable sequence representations. Results We developed a machine learning framework based on sequence-derived physicochemical descriptors for linear B-cell epitope prediction in Influenza A. A curated dataset of experimentally validated epitopes and non-epitopes was constructed using redundancy reduction and balanced sampling strategies. Five supervised classifiers were evaluated, and the effect of real versus artificial negative datasets was systematically assessed. Feature selection using Analysis of Variance and Mutual Information showed that predictive performance emerged from the combined contribution of multiple descriptors rather than single variables. The best model, Random Forest, achieved 82% accuracy, 83% F1-score, Matthews correlation coefficient of 0.65, and area under the receiver operating characteristic curve of 0.90. Benchmarking against widely used tools showed improved balanced performance on our curated Influenza A dataset. Availability and implementation Source code, processed datasets, and reproducible analysis scripts are freely available at GitHub: https://github.com/cparejabarrueto/epitopes

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Pareja-Barrueto et al. (Fri,) studied this question.

synapsesocial.com/papers/69fbefef164b5133a91a40ef https://doi.org/https://doi.org/10.1093/bioadv/vbag127

Bookmark

View Full Paper