Predicting protein variant effects is a key challenge in preparing for pathogenic viral strains, understanding mutation-linked diseases, and designing new proteins. Protein sequence-structure-function relationships are difficult to model due to complex allosteric and epistatic effects. To investigate efficient modeling strategies, we trained supervised machine learning (ML) models with deep mutational scanning (DMS) libraries of SARS-CoV-2 receptor binding domain (RBD) sequences labeled with angiotensin converting enzyme 2 (ACE2) binding affinity. These models demonstrate superior performance predicting combinatorial mutation effects compared to adding or averaging the effects of point mutations and exhibit strong extrapolative performance ranking omicron variants when training only near wild type (WT) variants. We characterize the RBD fitness landscape by combining ML with Markov Chain Monte Carlo simulations to predict evolutionary patterns from the WT sequence. These generate comparable sequence profiles to high-fitness sequences in DMS data and predict mutations in unseen omicron variants. These models provide insight into the relationship between RBD sequence elements and offer a new perspective on the use of DMS to predict emerging viral strains, which we anticipate will be applicable to other evolutionary prediction tasks. To facilitate application and future development of this strategy, we introduce Mavenets: https://github.com/SztainLab/mavenets.
Durumeric et al. (Wed,) studied this question.