Computational protein design has become a critical area of research in recent years. With the advent of deep learning, several foundational models have emerged which provide researchers with unprecedented intuition into deciphering the relationship between sequence and function. As more advanced neural networks are continually being developed, we must examine how these tools can be leveraged to engineer novel and effective design protocols. Furthermore, a thorough examination of how previous architectures can apply to unexplored problems in protein science is necessary to harness the full potential of machine learning. Here, we present our efforts to contribute to the development of computational protein design tools. In chapter one, we develop an HPC-enabled dynamic pipeline of pre-trained foundation models. This framework facilitates the iterative cycling and gradual optimization of proteins as they converge on high quality therapeutic binders. Our method achieves measurably improved design quality over baseline techniques, and showcases how dynamic compute resource allocation can improve the efficiency of functional landscape traversal. In chapter two, we use these large models to examine the allowable sequence diversity along one side of a viral protein interface. We sample functional variants at unprecedented mutational depth, and use this data to train a structure-aware graph classifier. Our model achieves excellent generalization to out-of-distribution data, allowing for distal variant effect forecasting. Chapter three focuses on the development of a sequence-based contrastive learning network tasked with learning the joint latent space which indicates if a given PDZ-peptide pair will bind. This framework achieves excellent results, exhibits significant generalizability, and is readily transferable to a diverse suite of alternate protein binding systems. Across all datasets, our method achieves comparable performance to state-of-the-art prediction techniques. We further push the contrastive model by applying it to both designed domains and peptides for rapid candidate screening. Binders which are viewed positively by the network showcase significant activity against their targets experimentally, further corroborating its strong predictive power. Finally, we highlight how structure-based graphs allow for similar recapitulation of protein binding data in chapter four. Here, we showcase our full structure prediction pipeline to generate high-quality geometric data to be used in downstream tasks. While our graph model exhibits robust performance on held out validation sets, it still falls short of the contrastive framework’s levels of generalizability. We then propose a number of future avenues to explore which would more fully harness the information-rich structural representations of protein binder samples to empower prediction.
Jonathan Evan Ash (Thu,) studied this question.