Drug discovery is getting increasingly tough and expensive. Despite technological progress, the cost of developing a new drug doubles every nine years. That’s Eroom’s law of Pharma, which mirrors Moore’s law for computer performance.
Nowadays, developing a new drug costs more than $2.8 Billion. In this context, pharmaceutical companies need to overprice their few successful drugs, to compensate for all the R&D failures of their drug portfolio. For example, a new treatment for hepatitis C, the Sovaldi, costs as much as $84,000 in the United States. The Pharma industry is greedy because of this feeling of scarcity.
In the tech industry, the situation is different. Optimism prevails. Tech is fueled by Moore’s law, the fact that computer performance is doubling every 18 months.
This exponential progress keeps prices low. For example, Google gives away the use of its new TPU chip for free, for some scientific projects. Tech companies are more generous due to their feeling of abundance. How can Tech help Pharma, especially at a time of expansion for Artificial Intelligence?
How can AI help for drug discovery?
A survey paper about deep learning for medicine has been published recently. In this post, I prefer to suggest some challenges, some ideas for future work. Some of them are definitely approachable by a student trained on online courses (Coursera, Stanford…).
GAN to generate new drugs
The idea is to train an Adversarial Auto-encoder with known anti-cancer molecules, and then generate new anti-cancer molecules. To measure performance, authors try to re-discover existing anti-cancer drugs.
I suggest to improve this GAN paper with better training data, better features for molecules, and to generalize the model to multi-drugs and multi-tasks settings.
The network architecture can also be improved, one author already discuss that in his blog.
Feed different datasets on the same model
Here is a warm-up exercise: instead of cancer data, try the AIDS Antiviral Screen Data. It can be adapted to the model with minor modifications. Then try other diseases and datasets (infectious diseases…). There are a lot of free datasets, but they are scattered around the web.
For cancer data, the paper used the NCI-60 Growth Inhibition Data. However, the NCI-60 has been deprecated in early 2016. It is a panel of 60 human cancer cell lines grown in culture, which has little relevance for real cancers. The new standard is ‘Patient-Derived Xenografts’ (PDXs), fresh human tumor samples grown in mice.
However, I did not find a large public dataset about the effect of anti-cancer drugs on PDX. Also, a lot of PDX data remains hidden in Pharma companies.
Features: from fingerprints to mol2vec
Another way to improve the model is by having better features. The GAN paper uses hand-crafted features, the MACCS molecular fingerprints. It represents a molecule with a binary vector. It’s a kind of one-hot encoding.
Instead, it would be better to have a dense representation of molecules, a kind of mol2vec, which would be analogous to word2vec in NLP: two molecules vectors will be near if their corresponding molecules are chemically similar.
Then perform convolutions on this graph, in a way that generalizes convolutions on matrices (a standard 2-D matrix is a square grid graph):
Molecular graphs seem the way to go for molecule representations. However, in practice, they still do not outperform molecular fingerprints. So more work is needed.
Moreover, a follow-up paper uses Molecular Graph Convolutions again, but this time, authors do not even bother to compare them with fingerprints. It would be great to perform a benchmark. This might be doable by a beginner, because authors also released a Python library built on the top of Tensorflow, to facilitate this kind of work: DeepChem.
Beyond molecular graphs
In order to outperform molecular fingerprints, maybe it is necessary to represent molecules with even more chemical realism. Here are two ideas:
- Molecules live in 3 dimensions, whereas molecular graphs are in 2D. So the 3D molecular structure could be taken into account.
- In Molecular graphs, edges represent chemical bonds. Those bonds have a well-defined localization between 2 atoms. However, this is only an approximation of reality, because from the viewpoint of quantum mechanics, particles are nonlocal. A particle sits simultaneously at all its possible locations.
This nonlocality can affect chemical properties, and therefore drug activity. For example, in an aromatic ring, electrons are delocalized between all the atoms of the ring. From a quantum viewpoint, it does not make sense to split this ring into edges.
The Molecular Graph paper uses one-hot encoding to represent whether an edge of the molecular graph is part of an aromatic ring (tables 2 and 3 page 7 in this paper). This leaves room for improvement.
Beyond Growth Inhibition: clinical trials and interactomes
In the GAN paper, the impact of a drug is measured by Growth Inhibition. This measure of success is very rough. In practice, there are a lot more parameters. For example, side effects should be taken into account. There is a database here. It’s also important to take into account the expected recurrence of the disease. Ultimately, it would be interesting to input the whole results of clinical trials into the model.
Modern treatments often involve multiple drugs, to minimize drug resistance. As a result, the GAN model should take as input multiple molecules. However, I did not find a dataset about that.
Finally, the GAN model of drug discovery should be able to discover drugs for multiple diseases at the same time. This multi-task learning improves performance.
Conclusion: a lot of challenges ahead!
In conclusion, drug discovery is a field full of exciting and impactful challenges for the Pharma & AI crowds, students, scientists and sponsors. All are welcome to communicate here on the blog, or on Startcrowd, a social network built around collaborative AI projects.