Medicine and the need for AI

I originally wrote this piece as an internal R&D direction document for my colleagues at doc.ai. We’ve decided to share it publicly because we feel that it’s important that as many people as possible are thinking about these issues.

Medicine has throughout history been a artisanal vocation — that is, it has focused on the skill and experience of the individual doctor, rather than looking to build a standardized process for diagnosing and treating patients. In recent years this has started to change, as initiatives like Evidence Based Medicine and Precision Medicine have tried to inject additional rigor and data-driven practices into the field. However, the vast majority of medical care is provided through the traditional Hippocratic philosophy.

This needs to change. The largest population centers on the planet have less than 1/10th of the doctors they need, and it will take hundreds of years to fill the gap. Misdiagnoses, late diagnoses, and over-diagnoses kill millions and cost tens of billions. The technology is now being developed to fix this problem — to give medical workers and patients a clear summary of the exact information they need, when they need it. Such technology can give a remote area community health worker access to a distillation of the world’s medical knowledge. It can make doctors in the developed world dramatically more productive and accurate , while giving patients and families more control over and insight into their medical care.

AI, specifically deep learning, has already shown that it can be a powerful diagnostic tool, for instance, showing super-human performance in medical imaging work such as:

Google’s diagnostic retinopathy system
Stanford’s dermatology algorithm
Enlitic’s work in lung cancer detection and malignancy classification

The challenges

Labeled historical data

It is widely believed that deep learning algorithms require vast amounts of data to be effective. This is not necessarily true. For instance, Enlitic’s lung cancer algorithm had access to scans of just a thousand patients with cancer. It’s important to understand that although the dataset (from the National Lung Screening Trial) was relatively small, it had the key characteristics to allow effective modeling:

It contained annual scans over 3 years for each patient; seeing the development of a disease over time is critical to creating diagnostic algorithms
Radiologists had provided annotations showing roughly where nodules were located, allowing the algorithm to focus on the important information
The dataset included information about each patient’s medical outcomes after the 3 years of the trial — labels showing things such as patient survival are necessary for creating diagnostic systems.

It’s also useful to see what couldn’t be provided by this project: treatment recommendations. Because the dataset did not contain longitudinal data for the patients showing what interventions were provided and how they responded, the algorithm developed was only useful for diagnosis, not treatment planning.

However, even this is extraordinarily powerful: currently the death rate for people diagnosed with lung cancer is nearly 90%, with nodules not found until they’re on average 40 mm in size. The system developed by Enlitic reliably found nodules of 5 mm or smaller. When found early, probability of survival is 10x higher!

Now think about how unusual this kind of dataset is. How often can we access a unified medical record containing all information about tests, diagnoses, and interventions for a patient over a multi-year period? The information is spread over multiple institutions, and within an institution over multiple departments.

Legal conservatism

Even when the data is available in a central location, or can be cobbled together from multiple sources, it is frequently the case that the institutions that hold it are wary of sharing it with the data scientists that can build these powerful algorithms. Legal staff recognize that one failure of privacy could end their career and cost their institution millions, whereas the theoretical cost of the missed opportunity (from saying ‘no’ to a data request) is hard to pin down.

Yet when patients are asked if they’re willing to share their medical data if it could help others in the future, most are very happy to grant permission to use their data — especially when that sharing could lead to better options for their own treatment in the future.

The opportunity for patients

Patient-controlled data

This leads to a clear opportunity: let patients be in control of their own medical data, across all their visits to different institutions and departments, as well as their own collected information (such as data from wearables, and self-reported data). Give those patients the opportunity to opt-in to sharing that data with particular data scientists for particular projects, give them a secure data environment , and in return give them:

Early access to the resulting medical breakthroughs
Financial remuneration
Information about the work that was completed thanks to their data, and how it’s helping other patients.

This may be the only way we’re going to see the true potential of deep learning in medicine — at least in the US (some centrally managed countries may be able to create the needed datasets through government decree).

There’s a very related opportunity for families caring for loved ones with rare or untreatable diseases: get together with other patients in the same situation and agree to pool patient data. The more other patients that can be brought into the pool, the more chance there is that the critical information will be available.

The Blockchain

Not all data is created equal. Data from people with rare diseases is critical to diagnosing and treating those diseases. Data over many years is more valuable than a short time period. At the other end, bad actors may even fake data in an attempt to fraudulently gain remuneration.

By using a blockchain we can create a clear auditable record of medical data sources. Based on this record, data providers can be rewarded based on how useful their data is in practice. Therefore, the more complete and accurate and relevant the data they provide, the more the reward.

It also provides very interesting opportunities for institutions, who (with their patient’s permission) can provide whole datasets to researchers, and in return get both financial returns and access to the technology that results from the data. I n the longer term, the patient can authorize their institutions to pass on their data to data scientists through the same blockchain-based approach.

The opportunity for data scientists

Most data scientists say they’d like to do something meaningful with their skills, but few get the opportunity. Much work for data scientists is in areas like ad-tech, hedge fund trading, and product recommendations. The major things stopping them from being able to do more meaningful work are access to data, knowing what problems need to be solved, and having a way for their solution to get noticed and implemented.

In order to turn data into useful results, data scientists need to be able to complete the following steps (which in practice are repeated multiple times in various orders):

Data cleaning
Exploratory data analysis
Creating a validation set
Building a model
Analyzing and validating the model

To do these steps, data scientists need a rich analytical environment where they can use their choice of tools, libraries, and visualization solutions. Most data scientists doing this kind of work today use R (generally R Studio) or Python (generally Jupyter Notebook).

By providing such an environment with the data pre-installed and the problem to solve clearly defined, the data scientist can quickly get to work on a meaningful problem.

It may even be possible for multiple data scientists to independently work on the same problem, with rewards being shared based on the utility of their work.

What we need to provide

Data gathering

We need to give each patient the ability to gather and maintain their personal medical data, including:

Lab tests and imaging studies
Diagnoses
Medications prescribed
Non prescription medications and supplements taken
Other medical interventions
Exercise and eating records
Family history (ideally, automatically maintained by linking across family members)
Self reported progress such as energy levels, happiness level, and so forth
Genomics and other tests

This means being able to download data from each patient’s medical providers, both a one-time download at setup, and on a regular basis after that, as well as using APIs for individual health tracking and wearable apps to import their data.

Data sharing

Each patient needs to be able to opt in or out of each request for their data. If the system is successful, there could be many requests and dealing with each individually could be burdensome — in this case we can give them the ability to set rules on which requests to automatically accept or reject, and which require manual intervention.

Each piece of data needs to be tagged with its source in an auditable way. It need not necessarily be stored on the patient’s device; indeed some types of medical data can be too large for on-device storage.

Once a patient gives a project access to some of their data, that data needs to be made available to the researcher. Each data scientist will need a rich analytical environment to be provided for their work. This will show them information about the problem they’ve been asked to solve, and show how to access the data for the project.

The big opportunity

Providing patients with the ability to have control over their medical data, and data scientists the ability to solve pressing medical problems, is a powerful idea. But it’s only the tip of the iceberg. The bigger opportunity is what happens when the models can be continually improved, and then all these models can be combined. Each data scientist’s feature engineering steps can be saved and made available to future researchers (and they would be compensated when their approaches are re-used), and their pre-trained model activations can be automatically brought into new models to see if they add predictive power.

Allowing new data to continuously improve existing models requires that the meaning and format of all data sources is consistent. This is a complex topic, but experienced data product project managers should have past experience. Changes to data source formats or semantics need to be identified up front, and constant model testing is critical.

Through re-using pre-trained models, we get all the benefits of combining data across all datasets, with none of the logistical or privacy challenges.

This also means that rare diseases and pediatric diseases, where there is only a small amount of data, can be effectively tackled. Pre-trained models will be used to analyze the data in those situations, and very simple models with few parameters can be used to combine them.

As we move forward, this approach to gathering and analyzing data will lead to new insights and will provide medical workers and patients a clear summary of the exact information they need, when they need it.