Drug discovery is getting increasingly tough and expensive. Despite technological progress, the cost of developing a new drug doubles every nine years. That’s Eroom’s law of Pharma, which mirrors Moore’s law for computer performance.
Nowadays, developing a new drug costs more than $2.8 Billion. In this context, pharmaceutical companies need to overprice their few successful drugs, to compensate for all the R&D failures of their drug portfolio. For example, a new treatment for hepatitis C, the Sovaldi, costs as much as $84,000 in the United States. The Pharma industry is greedy because of this feeling of scarcity.
In the tech industry, the situation is different. Optimism prevails. Tech is fueled by Moore’s law, the fact that computer performance is doubling every 18 months.
This exponential progress keeps prices low. For example, Google gives away the use of its new TPU chip for free, for some scientific projects. Tech companies are more generous due to their feeling of abundance. How can Tech help Pharma, especially at a time of expansion for Artificial Intelligence?
A survey paper about deep learning for medicine has been published recently. In this post, I prefer to suggest some challenges, some ideas for future work. Some of them are definitely approachable by a student trained on online courses (Coursera, Stanford…).
The idea is to train an Adversarial Auto-encoder with known anti-cancer molecules, and then generate new anti-cancer molecules. To measure performance, authors try to re-discover existing anti-cancer drugs.
I suggest to improve this GAN paper with better training data, better features for molecules, and to generalize the model to multi-drugs and multi-tasks settings.
The network architecture can also be improved, one author already discuss that in his blog.
Here is a warm-up exercise: instead of cancer data, try the AIDS Antiviral Screen Data. It can be adapted to the model with minor modifications. Then try other diseases and datasets (infectious diseases…). There are a lot of free datasets, but they are scattered around the web.
For cancer data, the paper used the NCI-60 Growth Inhibition Data. However, the NCI-60 has been deprecated in early 2016. It is a panel of 60 human cancer cell lines grown in culture, which has little relevance for real cancers. The new standard is ‘Patient-Derived Xenografts’ (PDXs), fresh human tumor samples grown in mice.
However, I did not find a large public dataset about the effect of anti-cancer drugs on PDX. Also, a lot of PDX data remains hidden in Pharma companies.
Another way to improve the model is by having better features. The GAN paper uses hand-crafted features, the MACCS molecular fingerprints. It represents a molecule with a binary vector. It’s a kind of one-hot encoding.
Instead, it would be better to have a dense representation of molecules, a kind of mol2vec, which would be analogous to word2vec in NLP: two molecules vectors will be near if their corresponding molecules are chemically similar.
Then perform convolutions on this graph, in a way that generalizes convolutions on matrices (a standard 2-D matrix is a square grid graph):
Molecular graphs seem the way to go for molecule representations. However, in practice, they still do not outperform molecular fingerprints. So more work is needed.
Moreover, a follow-up paper uses Molecular Graph Convolutions again, but this time, authors do not even bother to compare them with fingerprints. It would be great to perform a benchmark. This might be doable by a beginner, because authors also released a Python library built on the top of Tensorflow, to facilitate this kind of work: DeepChem.
In order to outperform molecular fingerprints, maybe it is necessary to represent molecules with even more chemical realism. Here are two ideas:
This nonlocality can affect chemical properties, and therefore drug activity. For example, in an aromatic ring, electrons are delocalized between all the atoms of the ring. From a quantum viewpoint, it does not make sense to split this ring into edges.
The Molecular Graph paper uses one-hot encoding to represent whether an edge of the molecular graph is part of an aromatic ring (tables 2 and 3 page 7 in this paper). This leaves room for improvement.
In the GAN paper, the impact of a drug is measured by Growth Inhibition. This measure of success is very rough. In practice, there are a lot more parameters. For example, side effects should be taken into account. There is a database here. It’s also important to take into account the expected recurrence of the disease. Ultimately, it would be interesting to input the whole results of clinical trials into the model.
Modern treatments often involve multiple drugs, to minimize drug resistance. As a result, the GAN model should take as input multiple molecules. However, I did not find a dataset about that.
Finally, the GAN model of drug discovery should be able to discover drugs for multiple diseases at the same time. This multi-task learning improves performance.
In conclusion, drug discovery is a field full of exciting and impactful challenges for the Pharma & AI crowds, students, scientists and sponsors. All are welcome to communicate here on the blog, or on Startcrowd, a social network built around collaborative AI projects.