Hackernoon logoHacking in silico protein engineering. by@KamilTamiola

Hacking in silico protein engineering.

Author profile picture

@KamilTamiolaKamil Tamiola

An artist impression of PaaA2 protein structural ensembles with alpha-helices colored in blue. The PaaA2 protein ensemble data is available in dSPP database as https://peptone.io/dspp/entry/dSPP18841_0


It is safe to say proteins are building blocks and the machinery which defines living matter. In the last 70 years tremendous progress has been made in their isolation, production, characterization, and finally engineering. Although great advancements in laboratory and industrial-scale protein production have been made, protein engineering and all the associated steps remain laborious, expensive and truly complicated.

If you prefer to skip further reading, and start hacking straight-away simply visit our đŸ‘‰đŸŒ Github repository at:

Proteins are polymers

Proteins are complex biomolecules made of 20 building blocks, amino acids, which are connected sequentially into long non-branching chains; commonly known as polypeptide chains.

Unique spatial arrangement of polypeptide chains yields 3D molecular structures, which define protein function and interactions with other biomolecules.

Although the very basic forces that govern protein 3D structure formation are known and understood, the exact nature of polypeptide folding remains elusive and has been studied extensively for the past 50 years.

Protein engineering is complex

We want to engineer proteins to enhance their properties. Typically, stability under different temperatures, pH or salinity. Frequently, researchers are aiming at improving catalytic performance of protein enzymes, or adding completely new types of chemical activities to known proteins.

The most common and established way to engineer a protein is to create its variants with substituted amino acids, also known as mutants. Subsequently, newly produced mutants are characterized using various experimental techniques to measure the degree of enhancement; e.g. scanning calorimetry, isoelectric point determination, simple solubility studies or advanced enzymatic activity assays. However, since there are 20 standard protein amino acids, a complete mutagenesis of 100-residue long polypeptide would yield 20Âč⁰⁰ mutant combinations, should you decide to explore all possible combinations of typical protein amino acids.

Exploring all possible mutants of a 100 amino acid protein (polypeptide) requires 100,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000 more time than the age of known universe, given you could express, purify and characterize each variant in 1 second time.

Quite likely, only a marginal fraction of the mutants would have desired properties, as usually the more you change the protein the further you step away from its original function (this is absolutely not a rule, as it is protein specific. However, a logical consequence of replacing a major part of protein with completely new amino acid sequence will likely be new fold, hence new functionality. Moreover, I have intentionally left out a fundamentally important fact — mutations may significantly affect protein dynamics, and thus its function).

Protein biotechnology is to a large extent hampered by scale and complexity of mutational analysis.

How can Machine Learning accelerate progress in protein science?

The most advanced and probabilistic (Bayesian) variants of Machine Learning depend heavily on the size and quality of input data. This argument is especially important for inference and prediction techniques in life sciences, where the levels of model complexity are perplexing or simply unknown.

Protein structure, function and dynamics predictions through Machine Learning methodology are not an exception. However, even with the relatively sparse (compared to a number of possible combinations of all protein amino acids in lengthy polypeptide chains) protein databases, Machine Learning can help to unravel complex, non-linear relationships between protein sequences and their structural variability and dynamics. These relationships are either very difficult to model or simply not fully understood. The predictions are coarse.

The biggest value of Machine Learning methods in prediction of biophysical properties of proteins is their ability to “equate” loosely related protein features to measurable experimental data. Thus predictions using complex numerical models that underlie Machine Learning methodology, can be further tweaked and refined by providing independent experimental proxies of protein structure and dynamics.

A simplified connectivity diagram of Hybrid Machine Learning methodology developed at Peptone — The Protein Intelligence Company.

Proteins are dynamic and exhibit variable degrees of disorder

Just like every other molecule present in our natural environment, polypeptide chains undergo molecular motions at time scales ranging from nanoseconds to minutes.

It is accepted that complete understanding of protein functions and activity requires knowledge of structures and dynamics.

Structural disorder is a very peculiar property of many known and characterised proteins. It has been attributed to specific patterns in protein sequence, and it has an immediate consequence for protein stability, susceptibility to enzymatic digestion inside living cells, protein-protein interactions and in turn a decisive role in many debilitating human pathologies.

From an industrial biotechnology point of view, the ability to accurately discern disordering effects of amino acid mutations in engineered proteins can save vast amounts of time and resources. An accurate disorder prediction for an arbitrary protein mutant can immediately report on problematic combinations of amino acid sequences, thus excluding the residues from further mutational analysis and vastly reducing the mutation search space.

Singular protein structure model is not enough

Under conditions of living organisms (aka native conditions) in an aqueous environment, the state of an arbitrary polypeptide can be thought of as an ensemble of structures, which at any given moment in time have slightly different conformations, as a consequence of protein dynamics and intrinsic “flexibility” or simply disorder.

MOAG-4 ensemble model has been kindly provided by Dr. Frans A.A. Mulder (Aarhus University, DK) and Dr. Predrag Kukic (University of Cambridge, UK). Please read “MOAG-4 Promotes the Aggregation of α-Synuclein by Competing with Self-Protective Electrostatic Interactions” to learn more about this protein and its medical relevance.

The image above demonstrates the superposition of models belonging to a structural ensemble of MOAG-4 protein, which in turn controls aggregation of proteins implicated in Parkinson’s disease. You can infer from this model that MOAG-4 has a relatively stable (well-defined) alpha-helical structure colored in grey, and a highly disordered tail, depicted by floating polypeptide chains of individual ensemble members.

Structural disorder

MOAG-4 (dSPP27058_0 in our database) is a medically relevant example of a protein that exhibits a high degree of intrinsic structural disorder.

The Alpha-synuclein ensemble has been adopted from “Structural Ensembles of Membrane-bound α-Synuclein Reveal the Molecular Determinants of Synaptic Vesicle Affinity”.

Alpha-synuclein, pictured above, is a seminal example of a completely disordered protein. Although the ensemble of Alpha-synuclein is heterogenous, this protein plays an important role in neurotransmitter mediation in human brain, and has been implicated as the key player in Parkinson’s disease development.

Putting protein order and disorder together

Among the multitude of advanced experimental protein techniques, NMR spectroscopy offers exquisite sensitivity to structural detail and dynamics at a single residue level. We have used NMR resonance assignment data from 7200+ proteins stored in public repositories and computed sequence-specific propensity scores.

The ensemble behavior of partially disordered MOAG-4 can be characterized and “compressed” (with few critical assumptions discussed in our paper) to a structural propensity vector.

Importantly, our method excels at capturing residual intrinsic disorder, as seen in the example of intrinsically disordered Alpha-synuclein.

Getting started with the data for Machine Learning

In order to facilitate open source development of Machine Learning methodologies for disorder and order predictions we have prepared our dSPP data in a Keras, Tensorflow and Edward friendly forms . Simply check our Github repository.

To install the dspp-keras integration, just type the following in your Terminal,

pip install dspp-keras

How to use dspp-keras to train models?

To load the dataset in your Python models, use the dsppkeras.datasets module.

from dsppkeras.datasets import dspp
X, Y = dspp.load_data()

How to encode protein sequence for Machine Learning?

One of the most important questions when designing a Machine Learning inference and training procedure are the data input and output. Amino acid sequence of an arbitrary protein can be represented as a one-hot encoded vector, with every amino acid assigned a unique binary representation. As an example, let’s consider the amino acid sequence of NS2B polypeptide from Zika Virus,


The one-hot vector, which represents the NS2B Zika Virus sequence can be written as,

np.array([0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0], dtype=np.uint8)

The output. Structural propensity data.

Individual, residue-specific scores are bound between 1.0 and 3.0. A propensity of 1.0 implicates the sampling of beta-sheet conformations. A score of 2.0 indicates behaviour found in disordered proteins, whereas 3.0 is an indicator of properly folded alpha-helix. A score of 2.5 should be understood as a situation when 50% of ensemble members form a helix and the remaining part samples different conformations.

The propensity score vector for NS2B polypeptide from Zika Virus is given by,

np.array([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.82, 1.70, 1.56, 1.43, 1.28, 1.59, 1.58, 1.77, 1.75, 2.06, 1.70, 1.95, 1.84, 2.17, 2.10, 2.16, 2.14, 2.21, 2.06, 2.06, 2.09, 2.07, 0.0, 0.0, 0.0, 2.02, 1.95, 1.96, 1.94, 1.97, 2.01, 2.06, 2.10, 2.14, 2.17, 2.13, 2.12, 2.05, 2.04, 2.03, 0.00, 2.09, 2.14, 2.16, 0.0, 2.16, 2.16, 2.16, 2.17], dtype=np.float32)

Note: 0.0 denotes missing experimental assignment data.

Making it all interactive

For the biology-oriented audience and curious computational scientists, we’ve created a web service at https://peptone.io/dspp.

Use it to find proteins, explore their propensities and preview all the data in a machine learning-ready form.

What’s next?

We have made a boilerplate model for development for you.

It is just a single Dense layer network. However, you should be able to add more layers and experiment with the best network architecture on your own.

Call to action

  1. Feel free to ask questions about dspp-keras at support@peptone.io
  2. If you have found our Github repo useful, give us a STAR ✭.
  3. If you have further questions about structural propensity, feel free to leave comments under this article.


dspp-keras is based on ongoing scientific research into protein stability and intrinsic disorder. Therefore we suggest:

  1. Download and read the original research paper at http://biorxiv.org/content/early/2017/06/01/144840
  2. Search, browse and download complete dSPP database at https://peptone.io/dspp
  3. Please cite us as,
@article {Tamiola144840,
author = {Tamiola, Kamil and Heberling, Matthew Michael and Domanski, Jan},
title = {Structural Propensity Database Of Proteins},
year = {2017},
doi = {10.1101/144840},
publisher = {Cold Spring Harbor Labs Journals},
URL = {http://biorxiv.org/content/early/2017/06/01/144840},
eprint = {http://biorxiv.org/content/early/2017/06/01/144840.full.pdf},
journal = {bioRxiv}

Importantly, if you have decided to keep on reading, I strongly recommend the recent articles by Matthew Heberling which elegantly describe the issues with current protein biotechnology protocols.

About Peptone

Founded in 2016 (Amsterdam, The Netherlands), Peptone offers state of the art solutions for protein biotechnology via Machine Learning and AI. We transform big data from public and private repositories into powerful predictive models and intuitive tools for protein production, stability, disorder, engineering, and directed evolution experiments, providing our clients with transparent and complementary software that saves time and yields precise research answers.


The Noonification banner

Subscribe to get your daily round-up of top tech stories!