Using Data and Social Graphs for Clinical Trials by@danielcrowe

Using Data and Social Graphs for Clinical Trials

Drug development is an incredibly difficult, protracted, and expensive process. AstraZeneca are looking at the way they recruit for clinical trials as a possible area for optimisation. Paul Agapow, Health Informatics Director at Astra Zeneca, spoke about his team’s work in building a social graph to reduce time and financial resources when recruiting for clinical trial trials. The social graph — knowledge graph — to improve clinical trials' processes and reduce costs by providing better clarity and access to heterogeneous datasets.
Daniel Crowe Hacker Noon profile picture

Daniel Crowe

Chicago > NYC > London @ Vaticle

Building a social graph — knowledge graph — to improve clinical trials' processes and reduce costs by providing better clarity and access to heterogeneous datasets.

Back in February, when we could all gather safely still, Grakn Cosmos, Grakn Labs’ first global user conference, hit London; and Paul Agapow, Health Informatics Director at AstraZeneca, spoke about his team’s work in building a social graph to reduce time and financial resources when recruiting for clinical trials.

"…this is a first step in it, for us to develop expertise to explore, to see where we can go — we are people with problems to solve."

The Valley of Death

Don’t Worry There’s a Happy Ending

Drug development is an incredibly difficult, protracted, and expensive process. This isn’t new news, often the process is analogised to walking into a casino, dropping £100k on the table, and betting it all on black 27.

This is because when a pharmaceutical company goes about manufacturing a new drug, one that has passed all phases of clinical trials, it is the result of ten years and roughly 2 billion dollars.

Many pharmaceutical companies have internally stored drug-protein interaction sets - 10s of thousands of candidate molecules that are able to interact with some part of the human body. This is the starting point for many drug development projects. These are then whittled down based on which are feasible to deliver, and which are able to be manufactured reasonably. Before even putting the drug in the human body, you may end up with around 250.

From these 250, which get put into phase 1 of clinical trials — where they check for safety and efficacy — you may get down to 5 before regulatory review. At the end of all this, you come out with 1 new drug.

From 10,000 possible candidates, down to 1. In between, we have 2 billion dollars and roughly ten years. This is what we call the ‘Valley of Death’.

When you pay for a drug, you aren’t paying for that drug’s development as much as you are paying for all those failures. Companies like AstraZeneca are looking at the way they recruit for clinical trials as a possible area for optimisation.

But First, an Abridged Overview of the Clinical Trial Process

Phase 1: Looks at the safety of a drug when administered to a low quantity of patients. Often 10s of people are included in a trial at Phase 1.

Phase 2: Reassess safety — is it safe for humans? — and look at the efficacy of the intended remediation. During Phase 2, the drug is tested on 100s of patients.

Phase 3: A final safety and efficacy check — will this drug accidentally kill you and does it do any good? — as well as asking whether this does its job better than anything currently in production?

Clinical trials benefit from having a large pool of subjects. Generally speaking, the more subjects, the more statistical power. However, as each subject costs between $1–10k and 60% of clinical trials fall short of their recruitment targets, this is one of the more arduous aspects of the process.

How Does Recruitment Work?

Traditionally, recruiting was done through social greasing: individual connections, brochures, recruitment sites, advertisements around hospitals, etc. Today, this doesn’t work with the increased recruitment targets and drugs being put into the pipeline.

With Federated Electronic Health Records (EHRs), such as Trinetx, you get a large subset of patient data from various regions. The downside is that this is still just a subset of the total potential subjects, often lacking the detail needed for qualification. In the below slide, Paul provided some of the ways to disqualify a potential subject.


You’ll notice that many of these disqualifications could be mitigated by asking social questions. Soft questions that pharmaceutical companies are notoriously bad at asking and that may help to reduce the time and resources involved in securing a complete trials subject group.


Real-World Evidence

As organisations and teams reckon with this reality, there’s data out there that is under-utilised, cheap, and increasing in volume and scope. It is data that often cannot be acquired any other way and represents a more realistic picture of individuals and or groups of individuals. In some circles within the life sciences space, Paul included, this data is called “Real-World Evidence” or “Real-World Data”.

What we’re talking about is data from health trackers and wearables, electronic medical records and survey data, hospital and pharmacy data, social media, and consumer data. Data that is messy, that doesn’t come from a controlled clinical trial, i.e. anything that does not fit neatly into a table.

This data is increasingly being considered as a driver of signals that were previously not available or not identifiable. Aside from the fact that this “new” data is cheap, it enables pharmaceutical companies to cover population segments, previously inaccessible. Additionally, the data covers questions that companies are not allowed to ask because of ethical considerations.

These “signals of the world” provide an expanded scope for clinical trials where predominately the population segments involved are not typical. Paul notes that often the demographics of those in trials are young, college-educated, white males.

Hence the need for “Real-World Evidence” — but how is it going to help reduce costs and time?

The Solution

Let’s start with what we have:

  • Newly accessible, “real-world evidence”, social data coming from heterogeneous sources — messy data
  • An opportunity to identify and discover, new patient pools
  • An opportunity to qualify more quickly and accurately to avoid spending time and money on unqualified subjects

The solution that Paul and his team at AstraZeneca came up with was a social graph of trial practitioners. Paul called it a “Facebook for oncology doctors”. This graph would connect them through their collaborations, publications and workplaces. They could annotate the graph with these various sources of “Real World Evidence” and ask questions about these practitioners.

The project was called OPSIN — Oncology Practitioner Social Interaction Network, as part of a larger effort to “streamline and galvanise clinical trials”.

Interestingly enough, in 2008, a team put together a similar concept, using “hand-rolled” graph databases, scraping the internet for any data that would allow them to find the key opinion leaders.

Today, with access to such an abundance of information and technical platforms for tackling this problem.

There is a growing interest, in Pharma, in the graph area. Particularly in knowledge graphs that so far, have been in biochemical and biological entities.

Additionally, they can combine this external information now available to them, with all of their internal data and maybe more importantly the knowledge that comes with years of trials and tribulations.

Building a Social Graph

Learning from their predecessors, there were a few problems that ran through those early efforts. The key was that they were not built for sustainability — these early graphs were not built for more than just a one-off project. Paul notes that these were often built by someone using R, or hacking together a python script; hand building their own graph database.

None of these were suitable operationally, they would need to be plugged into other systems so that teams across the organisation are able to make use of it. This is where Grakn steps in.

Grakn is a database for complex data, from the team at Grakn Labs. Knowing that the team at AstraZeneca were going to be interested in edges, the provenance edges, and making statements about edges, they came to realise what they were really talking about was hypergraphs. Grakn, being a hypergraph database, makes that so much easier.

Basics of the Model

For bioinformaticians who love formal schemas, often they are chomping at the bit to dump their internal, giant ontology in, without thinking whether that is the right model for the situation. Paul was encouraged by some of the Grakn Labs’ team to not start with their existing ontology but rather start with a series of questions that they will want to ask. Drawing from the line of thinking above, we’ll go through a few of the questions bioinformaticians can ask below.

However, as they do need a schema, here’s some of the decisions that they made:

  • Keep things simple to start, don’t model the entire world right from the start
  • Use the questions you want to ask as a way of drawing out potential concepts for your model — often when we articulate the questions out loud or on paper, we are able to see things more clearly
  • Start with people and institutions, connect them through relations such as: collaboration and publication
  • Give each of these relations a temporal element as an attribute
  • Keep scope to 5 years — people and connections change, 5 years feels relevant enough for our purposes

Questions We Can Ask

We’re already using X at site Y. Who else can we consider? Who do they know?

Friend-of-a-friend questions like these are trivial with a hypergraph. Traced through collaboration to another, you’re able to answer these and rank by the number of connections, how recent, or some other temporal filter on the connection.

Is there a practitioner nearby? Are there patients in the same city that we can refer to the trial site?

While it may be difficult to express the notion of “nearby”, there are other ways around this at the application layer — which, as of this writing, they haven’t needed to do yet. Grakn’s reasoning engine allows for transitive relations to be inferred when queried.

What communities of practitioners are there? Who are the key figures in these communities?

Clusters on clusters, identifying central concepts in the graph, and connections between them are also done trivially.


What did I learn by listening to Paul’s story? Well, even as the Pharmaceutical world is becoming more and more data-driven, there are still many soft questions to ask that bring great value to the clinical trials process. There is an affluence of data out there that can be used to tackle these questions.

It is a human enterprise, we are going to have human questions…

Paul also took time to thank this team and I would be remiss to leave them out: Domingo Salazar, Maja Malkowska, Linghui Li, and Gabi Feldberg.

Special thank you to Paul and his team at AstraZeneca for sharing their story and words of wisdom. This is obviously an area primed for optimisation and these projects are helping to push the entire industry forward

You can find the full presentation on the Grakn Labs YouTube channel here.

Also published at


Join Hacker Noon

Create your free account to unlock your custom reading experience.