We humans have a natural tendency of finding patterns in the incompleteness of information available to us. One plus one is greater than two; the whole is indeed more than the sum of its parts. We love to make causal connections, we make decisions, judgements, assumptions based on incomplete data on a daily basis. ‘I like this dress, but it won’t suit you’ or ‘He is nice, but not your type’ are peculiar decisions we make everyday without having too much information about the situation or the person being judged. And yet we find ourselves to be correct in a large part of these everyday decisions. It might seem odd at first, but we do have a lot of practice in making these decisions. Most of us are experts in mundane decision making. This ‘blink of an eye’ moment — which Malcolm Gladwell points out in his book titled ‘Blink’ — is a survival instinct. You hear expert comments like ‘this doesn’t feel good’ or ‘I don’t like how it looks’ and often find yourself bewildered at their lack of sound reasoning, yet they are more than often right. Bear in mind that being an ‘expert’ is the key tenet of giving such vague remarks and I often find myself furious when sound logic is absent in an argument. These experts have accumulated a large body of information over a number of years, have practiced for a long time and imbibed these decision making instincts in the core of their personality. They make decisions without knowing why they made them.
I am going to disagree.
Death of the domain expert
The world is extremely messy and it is impossible to be a know-it-all. A true expert is a fictional entity. The large problems that we face today present us with a daunting task and this human tendency of finding patterns falls flat in the age of information. We are presented with more data than was ever available to humankind. In a time where there is an exponential inflow of information from multiple places, we can’t always find patterns easily, at least not based on instinct alone.
The traditional systems of dealing with data involved coming up with a hypothesis and analyzing the data to prove or disprove it. For example, we could hypothesize that ‘only about 5% of data is in a structured format’ and go on to sift through a sample dataset to prove or disprove this claim (FYI, the above hypothesis is true). However, Big Data analysis, in most cases, does not begin with a hypothesis. It is akin to a crystal ball where you ask the dataset a question and let it tell you its story. It will always answer based on the questions asked. We assume nothing when working with Big Data. The stories presented are often in the form of correlations. It is always a probability, never a certainty.
Patterns and correlations really shine in the Big Data world. ‘Google flu’ is one such experiment that uses Big Data to predict the outbreak and spread of diseases. It does this by a clever way of scanning through people’s search queries — higher the number of people searching for a particular symptom related to an illness in a locality, higher the chances of the outbreak of the disease. Google flu was extremely useful during the Ebola epidemic in 2014 where government officials could predict affected areas in realtime and take appropriate actions. The system makes a clever correlation between the search queries and symptoms of the virus thereby predicting the probable affected localities. If you can recall, this is similar to what John Snow tried to achieve with his Cholera outbreak map.
However, when using data to make predictions, one needs to be wary of mistaking correlation with causation. One study suggests that women are happier than men. On an average, women have longer hair than men. A correlation one can make from the available data is that having long hair is the reason for happiness — this is a weak correlation. Mistaking correlation for causation almost led the state of Illinois, USA to send books to every child in the state because studies showed that books in the home environment correlated to higher test scores. Later studies showed that children from homes with many books did better even if they never read, leading researchers to correct their assumptions with the realization that homes where parents buy books have an environment where learning is encouraged and rewarded.
Big data simply means an extremely large quantity of data. Strong correlations really shine in the Big Data world, despite some of them being extremely contrary to predominant notions.
In 1996, Brendan Reilly took charge of the Cook County Hospital’s department of medicine. Cook County is a public hospital with an inflow of a huge number of patients each day. The hospital’s Emergency department (ED) was in special turmoil. The government had done a great job of educating its citizens about myocardial infarction. also known as ‘heart attack’. This resulted in an increased inflow of fake heart attack cases to the already strained ED — Almost 30 patients a day admitted to the ED were worried they had a heart attack. In a system with constrained resources, how does one decide who needs what? How do you appropriate resources to the patients in real need? Bear in mind that chest pain patients are resource intensive and the treatment period is long. Doctors often rely on a large number of parameters derived from various tests and series of questions asked, then make an estimate of the probable ailment. You often find different doctors giving a different opinion to the same set of symptoms. A large part of the doctor’s accuracy is function of experience, maybe that’s why it is called a practice. In an intensive environment like the emergency ward, doctors need to make decisions quickly. As human beings faced with life and death situations, doctors tend to lean heavily on the side of caution. Why take the risk of ignoring the chance of a heart attack even though it might be a fake alarm? The problem is that this makes the estimate very inaccurate which the resource constrained emergency ward cannot afford. Reilly showed different doctors sets of symptoms and asked them to categorize the patients based on severity. The results were all over the place, where some doctors sent the patients to the intensive care unit while some sent them home. The answers pretty much ranged from zero to hundred. It looked more like a guess and guessing leads to mistake.
Reilly turned to the work of the cardiologist Lee Goldman who had devised an algorithm to identify the presence and severity of a heart attack. In the 1970s, Goldman had worked with a bunch of mathematicians where it struck him that some of the statistical principles may be applied to the problem of detecting a heart attack. He analyzed hundreds of cases, fed them into a computer, started looking at what kind of things led to a heart attack and came up with an algorithm. He believed that the equation would take away much of the guesswork out of treating chest pain especially in a emergency situation. The algorithm suggested that doctors would need to combine the evidence of ECG with a set of urgent risk factors: (a) is the pain felt by the patient an unstable angina (an intense local pain)? (b) Is there fluid in the patient’s lungs? and (c) Is the patient’s systolic blood pressure below 100? Goldman drew a decision tree for each combination of factors that recommended the appropriate treatment. Despite the rigor in his calculations, no one wanted to believe in the algorithm. Life and death situations ought to be complex problems — a simple algorithm seemed too easy.
At the Cook County hospital, Reilly held a bake off. The staff at the Emergency department would follow the traditional protocol for a few months, then they would use the Goldman’s algorithm. The diagnosis and symptoms of every patient treated under the two systems would be compared. This data was collected for two years and, surprisingly, Goldman’s algorithm triumphed. It was 70% better than the traditional method and at the same time, it was safer. The goal of the algorithm after all was to make sure that patients suffering from a real heart attack would be assigned the right resources, right away. Goldman’s algorithm goes right in the face of traditional notions that the more information decision makers have, the better off they are, when in fact more information leads to more confusion. Not every factor is a strong correlation. Having more information is sometimes not an advantage at all. Often the solution might not explain the cause of a problem, only that the problem exists.
It feels mundane or too easy to follow a simple algorithm, because anyone can follow an algorithm. It can’t be this simple. It doesn’t feel right.
Analysis of Big Data helps us arrive at these simple algorithms by identifying strong correlations between the effect and the variables. Its results sometimes seem too simple for a large complex problem at hand. Like doctors, many professionals fall in the trap of excessive information fallacy. It feels mundane or too easy to follow a simple algorithm, because anyone can follow an algorithm. It can’t be this simple. It doesn’t feel right.
Statistics and the Sesame Street
The notion of Big Data has been there for a long time. The astronomical dataset created by Copernicus in the 15th century was probably the first Big dataset made public. We have been using and analyzing large quantities of data for a long time, often in the most unexpected places.
Creating a TV show for children is no minor feat. Making of the TV show Sesame Street was no epiphany. It was deliberately engineered — ‘engineered’ seems to be an inapt word yet that’s precisely what it was. It was designed around a single breakthrough insight: that if you can hold the children’s attention, you can educate them. Most adults watch TV when we feel stimulated and change channels or look away when we get bored. This is not how preschoolers watch TV. They do not zone out in front of a screen, rather constantly shift gazes between what makes sense on the screen and the surrounding environment. In one experiment, kids were shown a reedited episode where certain scenes were out of order. It didn’t make any sense and the kids stopped watching. In a controlled experiment, one group of children were shown an episode in a room full of toys while the other was shown the same episode in an empty room. As expected, the first group watched the show a lot lesser. However, the goal was not to measure the engagement time of the kids but rather their learning. When tested, the two groups had the exact same scores. This was a surprising insight for the researchers: kids watched what they understand and looked away when they got confused.
Ed Palmer was a psychologist and the head of research at Sesame Street. With expertise in the use of television as a teaching tool, Palmer pursued the design of the episodes with a statistical approach. He used a technique he called the ‘Distractor’ where a group of children were shown an episode of Sesame Street on one screen and a second screen ran a slideshow of random images. The slides changed every seven and a half seconds. Palmer and his associates sat in the back of the room noting when the children lost interest and looked at the slideshow. By the end of the study, Palmer’s team had a second by second data of engagement through the entire episode. They plotted the data on a large sheet of paper as a line graph of the engagement times during the length of the episode. They noted down parts which worked and which did not work, often reshooting entire scenes to reach the optimum engagement time of 85–90%.
Although there is no denying that the Big Data phenomenon presents us with an evil side of an Orwellian future, we also need to bear in mind that it has a number of positives that can push humanity in the right direction. Big Data is a resource, but unlike physical resources, the value of information doesn’t diminishes with use, and often its use increases the underlying value.
I began this post by contradicting the phenomenon of the ‘gut’ or ‘instinct’. However, I truly believe that what is valuable about human beings is precisely this intuitive nature, that cannot be revealed by silicon chips and algorithms. We too, are more than the sum of our parts. It is what is yet to come that is exciting.
Like Buddha said: we need to find a middle ground. There is a place for instinct and data to lay side by side. The gut is as important — it is the first step but not the last one.
- Blink by Malcolm Galdwell
- The Tipping Point by Malcolm Galdwell
- Big Data by Kenneth Cukier and Viktor Mayer-Schönberger
Thank you Kanika for the last minute edits :)