This paper is available on arxiv under CC 4.0 license.
Authors:
(1) Jeffrey Ouyang-Zhang, UT Austin
(2) Daniel J. Diaz, UT Austin
(3) Adam R. Klivans, UT Austin
(4) Philipp Krähenbühl, UT Austin
Protein engineering is the process of mutating a natural protein to improve particular phenotypes, such as thermodynamic stability and function. The field has historically relied on rational design and stochastic methods, such as error-prone PCR [1], DNA shuffling [71], and directed evolution (DE) [4, 27], to identify gain-of-function mutations. Rational design is limited to proteins with solved structures and requires an extensive understanding of biochemistry and specific knowledge of the particular protein to select mutations highly likely to improve the target phenotypes [33, 79]. Directed evolution requires little to no knowledge of the protein and instead generates a library of protein variants that are then screened for the target phenotypes (e.g. fluorescence brightness, antibiotic resistance, stability, activity) [4, 27]. The library can be generated via site-saturated mutagenesis of a handful of positions in the sequence [12] or DNA shuffling, in which the gene is digested into random fragments and reassembled into full-length sequences [71]. After screening the library, the most "fit" variant is then selected as the initial sequence for the next round of directed evolution. This iterative process repeats until it obtains a protein variant with the desired phenotypes.
Machine learning has demonstrated its ability to augment rational design and accelerate the stabilization of a variety of proteins [20, 28, 40, 52, 68]. Separately, machine learning-guided directed evolution (MLDE)[75] has been shown to improve the likelihood of obtaining the global fitness maximum by 81-fold compared to traditional DE [76]. MLDE has accelerated the engineering of several proteins, such as the enantioselectivity of enzymes for kinetic resolution of epoxides [23] and the activity and expression of a glutathione transferase [46]. Mutate Everything empowers the experimentalist to accelerate the stabilization of a protein for both rational design and MLDE.
Recent advances in machine learning have led to remarkable progress in protein structure prediction. AlphaFold [31] has demonstrated that deep learning is highly effective at predicting protein structures from a sequence by using evolutionary history via a multiple sequence alignment (MSA). AlphaFold passes the MSA and a pairwise representation of the sequence into Evoformer to capture the coevolutionary patterns between residues. The Evoformer output is then processed by the Structure Module, which predicts the protein’s structure. We challenge prior works postulating that AlphaFold cannot be used for stability prediction and show that fine-tuning these rich co-evolutionary and structural features yield highly performant stability predictors [53].
Evolutionary Scale Modeling (ESM) [38, 62] has shown that protein structure prediction can be performed without MSAs and specialized architectures by leveraging large transformers pre-trained on masked token prediction. Other works extended this masked pre-training framework, including MSA-Transformer [60] which incorporates a sequence’s MSA as input, and Tranception [49], which develops a hybrid convolution and attention based autoregressive architecture. We show that finetuning these evolutionary representations excels at protein stability assessment without MSAs.
Traditional approaches to protein stability assessment relied on a combination of physics-based methods, statistical analysis, and traditional machine learning techniques. Physics-based tools, such as FoldX, Rosetta, and SDM, utilize energy functions and statistical patterns to assess how mutations affect a protein’s stability [32, 67, 77]. DDGun [44] directly inferred ∆∆G from heuristics, including Blosum substitution scores, differences in interaction energy with neighboring residues, and changes in hydrophobicity. Many traditional machine learning approaches used support vector machines and decision trees with physics-based feature engineering [11, 14, 15, 17, 37, 66]. Others ensemble existing machine learning methods [35, 56, 59, 63, 64].
Recently, deep learning-based approaches have recently begun to outperform existing physics and traditional machine learning approaches [13, 19, 34]. ACDC-NN [6] trains an asymmetric convolutional neural network for predicting forward and reverse mutation effects. Thermonet [36] voxelizes and feeds both the wild-type and mutant protein structures into a 3D-CNN to regress a ∆∆G value. PROSTATA feeds the wild-type and mutant protein sequence into a pre-trained ESM2 model and then regresses a ∆∆G value [74]. Stability Oracle takes the native and mutant amino acid type along with the local structure as input and regresses ∆∆G [18]. Several deep learning frameworks [64, 80, 81] model multiple mutations in addition to single mutations. In this paper, we develop a framework that models the protein to enable efficient enumeration of all mutation candidates.