Drug discovery is getting increasingly tough and expensive. Despite technological progress, the cost of developing a new drug doubles every nine years. That’s Eroom’s law of Pharma, which mirrors Moore’s law for computer performance. Nowadays, developing a new drug costs more than $2.8 Billion. In this context, pharmaceutical companies need to overprice their few successful drugs, to compensate for all the R&D failures of their drug portfolio. For example, a new treatment for hepatitis C, the Sovaldi, costs as much as $84,000 in the United States. The Pharma industry is greedy because of this feeling of scarcity. Drugs are getting more expensive In the tech industry, the situation is different. Optimism prevails. Tech is fueled by Moore’s law, the fact that computer performance is doubling every 18 months. Moore’s law This exponential progress keeps prices low. For example, Google gives away the use of its new TPU chip for free, for some scientific projects. Tech companies are more generous due to their feeling of abundance. How can Tech help Pharma, especially at a time of expansion for Artificial Intelligence? Computing power is getting cheaper How can AI help for drug discovery? A survey paper about deep learning for medicine has been . In this post, I prefer to suggest some challenges, some ideas for future work. Some of them are definitely approachable by a student trained on online courses ( , …). published recently Coursera Stanford GAN to generate new drugs I choose to pick and see how to improve it. This paper proposes to generate new drugs against cancer, by using . I think it’s awesome. one paper Generative Adversarial Networks GAN can , but also new molecules generate pics of new animals The idea is to train an with known anti-cancer molecules, and then generate new anti-cancer molecules. To measure performance, authors try to re-discover existing anti-cancer drugs. Adversarial Auto-encoder I suggest to improve this with better training data, better features for molecules, and to generalize the model to multi-drugs and multi-tasks settings. GAN paper The network architecture can also be improved, one author already discuss that in his . blog Architecture of the molecule generator network Feed different datasets on the same model Here is a warm-up exercise: instead of cancer data, try the . It can be adapted to the model with minor modifications. Then try other diseases and datasets (infectious diseases…). There are a lot of free datasets, but they are scattered around the web. AIDS Antiviral Screen Data For cancer data, the paper used the . However, the NCI-60 has been . It is a panel of 60 human cancer cell lines grown in culture, which has little relevance for real cancers. The new standard is ‘ ’ (PDXs), fresh human tumor samples grown in mice. NCI-60 Growth Inhibition Data deprecated in early 2016 Patient-Derived Xenografts Tumors grown in mice are the new standard However, I did not find a large public dataset about the effect of anti-cancer drugs on PDX. Also, a lot of PDX data remains hidden in Pharma companies. Features: from fingerprints to mol2vec Another way to improve the model is by having better features. The GAN paper uses hand-crafted features, the . It represents a molecule with a binary vector. It’s a kind of . MACCS molecular fingerprints one-hot encoding Instead, it would be better to have a dense representation of molecules, a kind of mol2vec, which would be analogous to in NLP: two molecules vectors will be near if their corresponding molecules are chemically similar. word2vec In this direction, there are interesting papers about Molecular Graph Convolutions ( and ). The idea is to start from molecular graphs: here here Then perform convolutions on this graph, in a way that generalizes convolutions on matrices (a standard 2-D matrix is a square grid graph): Molecular graphs seem the way to go for molecule representations. However, in practice, they still do not outperform molecular fingerprints. So more work is needed. Moreover, a uses Molecular Graph Convolutions again, but this time, authors do not even bother to compare them with fingerprints. It would be great to perform a benchmark. This might be doable by a beginner, because authors also released a Python library built on the top of Tensorflow, to facilitate this kind of work: . follow-up paper DeepChem Beyond molecular graphs In order to outperform molecular fingerprints, maybe it is necessary to represent molecules with even more chemical realism. Here are two ideas: Molecules live in 3 dimensions, whereas molecular graphs are in 2D. So the 3D molecular structure could be taken into account. Molecules live in 3 dimensions In Molecular graphs, edges represent chemical bonds. Those bonds have a well-defined localization between 2 atoms. However, this is only an approximation of reality, because from the viewpoint of quantum mechanics, particles are nonlocal. A particle sits simultaneously at all its possible locations. This nonlocality can affect chemical properties, and therefore drug activity. For example, in an , electrons are delocalized between all the atoms of the ring. From a quantum viewpoint, it does not make sense to split this ring into edges. aromatic ring Formation of an aromatic ring The Molecular Graph paper uses one-hot encoding to represent whether an edge of the molecular graph is part of an aromatic ring (tables 2 and 3 page 7 ). This leaves room for improvement. in this paper Beyond Growth Inhibition: clinical trials and interactomes In the GAN paper, the impact of a drug is measured by . This measure of success is very rough. In practice, there are a lot more parameters. For example, side effects should be taken into account. There is a . It’s also important to take into account the expected recurrence of the disease. Ultimately, it would be interesting to input the whole results of into the model. Growth Inhibition database here clinical trials An even better thing would be to anticipate the various effects of a drug using networks of molecular interactions: the . There is already around this topic. interactomes some work Network of interactions Combination therapy Modern treatments often involve , to minimize drug resistance. As a result, the GAN model should take as input multiple molecules. However, I did not find a dataset about that. multiple drugs Modern treatements involve combinations of drugs Multi-task learning Finally, the GAN model of drug discovery should be able to discover drugs for multiple diseases at the same time. This improves performance. multi-task learning Multi-task neural network Conclusion: a lot of challenges ahead! In conclusion, drug discovery is a field full of exciting and impactful challenges for the Pharma & AI crowds, students, scientists and sponsors. All are welcome to communicate here on the blog, or on , a social network built around collaborative AI projects. Startcrowd