Abstract Motivation Predicting linear B-cell epitopes remains a major challenge in immunoinformatics, particularly for rapidly evolving viruses such as Influenza A. Many existing predictors rely on heterogeneous training datasets, poorly defined negative samples, or low-interpretability models, which can limit performance on pathogen-specific tasks. Improving prediction therefore requires biologically meaningful datasets together with informative and interpretable sequence representations. Results We developed a machine learning framework based on sequence-derived physicochemical descriptors for linear B-cell epitope prediction in Influenza A. A curated dataset of experimentally validated epitopes and non-epitopes was constructed using redundancy reduction and balanced sampling strategies. Five supervised classifiers were evaluated, and the effect of real versus artificial negative datasets was systematically assessed. Feature selection using Analysis of Variance and Mutual Information showed that predictive performance emerged from the combined contribution of multiple descriptors rather than single variables. The best model, Random Forest, achieved 82% accuracy, 83% F1-score, Matthews correlation coefficient of 0.65, and area under the receiver operating characteristic curve of 0.90. Benchmarking against widely used tools showed improved balanced performance on our curated Influenza A dataset. Availability and implementation Source code, processed datasets, and reproducible analysis scripts are freely available at GitHub: https://github.com/cparejabarrueto/epitopes
Pareja-Barrueto et al. (Fri,) studied this question.