Three weeks into my journey to become a data scientist and I’ve officially been baptized… by fire, that is! I chose to attend Flatiron’s Data Science 15-week bootcamp to transition out of finance. So far, the program has exceeded expectations (and my expectations were high).
While the curriculum is rigorous and fast-paced, it’s well constructed, the instructors are dedicated to helping students learn, and my cohort is amazing — everyone is friendly, helpful, smart, and undoubtedly will go on to accomplish great things.
This series of blog posts is dedicated to them… Here’s to the future data scientists!
We cover a ton of information in class, so when I discover something helpful from outside of class that sharpens my perspective or skills, I’ll use these next several posts to share it. I look forward to reading my cohort-mates’ posts as well! More broadly, this series is for anyone who is interested in data science.
When studying data science, one of the first realizations is that data cleaning is time consuming (and many find it the least enjoyable aspect of the job).
Estimates show a range of 60–80 percent¹ of time is spent on cleaning and organizing data, with a much smaller portion spent on analysis (the fun part).
With this statistic as a backdrop, I figured I’d share a few useful techniques and libraries that I’ve discovered to make the data cleaning process easier.
Specifically, I’ll focus on how to deal with missing data. I’m a visual learner and I like frameworks, so the objective of this post is to provide a basic process for identifying and treating missing data by:
Some background Information… For this analysis, I used the Titanic dataset because it’s readily accessible (you can download it here) and it does not require domain knowledge to understand the data (unlike healthcare data, for example). I did not break it up into training and test sets as the goal here is just to cover (and uncover) issues with missing data. Finally, all code is Python.
EDA is critical to understanding the features of the dataset and (particularly relevant to the current topic) uncovering any hidden missing values that might be lurking. You can slice the Titanic data in myriad ways, but here’s some charts I made that should suffice for our purposes.
The first row segments the population by sex. As you can see, males outnumber females by about a 65/35 proportion. However, when it comes to survival, females dominate. Close to 75 percent of females survive, compared to under 20 percent of males. For anyone who’s seen the Titanic, this comes as no surprise — women and children boarded lifeboats first. In aggregate, median age for both sexes is pretty close (27 years for females, 29 years for males).
The second row is where things get a little more interesting. We see that most people on the ship are 3rd class (in other words relatively poor). It also comes as no surprise that the majority of casualties belong to the 3rd class; the poor boarded lifeboats last, if at all. An additional insight is that the median age varies substantially by class. Finally, there’s a mysterious question mark on all the second row charts — it’s a null value masquerading as a real value!
From EDA we have already gained some key insights that will carry over into our analysis of the missing data. We’ll have to convert the “?” into a null/missing value and decide how to deal with it, and we know that passenger class and sex are likely going to factor into any imputation we may need to do.
Identifying the missingness type helps narrow down the methodologies you can use for treating missing data. The Missing Data Mechanism is the standard framework. In it, there are three types of missingness:
Missing Completely at Random (MCAR). The missing values have no correlation with other values in the dataset observed or missing. “There is nothing systematic going on that makes some data more likely to be missing than others (Analysis Factor)².” For example, if a few students in a class happen to be absent one day for their own individual reasons, that’s MCAR.Missing at Random (MAR). “There is a systematic relationship between the propensity of missing values and the observed data, but not the missing data².” Moreover, there’s an underlying reason for missingness that can’t be directly observed. Continuing with our classroom example, it’s flu season and a few students are out sick.Missing Not at Random (MNAR). There is a relationship between missingness and its values, missing or non-missing². Missingness of a certain value depends on the true value itself. Extending the prior example, lets say there’s a statewide standardized testing day and a number of students are absent. Perhaps the students who feel insecure about standardized testing are most likely to be absent. Whether the missing data are missing because of the actual values is a hard question to answer. In our classroom example, the question would be whether students most likely to perform poorly on the test are missing.
There are some tests to diagnose the type of missing data, like Little’s Test for MCAR, but the only true way to distinguish between MAR and MNAR is to measure some of the missing data². However, this just isn’t possible in most cases, so classifying the type of missingness really comes down to analyzing patterns in the missing data and relying on domain expertise to guide decision-making. Fortunately, there is a library that simplifies missingness pattern assessment.
A great way to explore the missing data is to visualize it with missing no. The first chart to look at is the nullity matrix³. In the Titanic dataset, after we replace the “?” values with null values, the matrix shows there are four columns with missing data (Passenger Class, Age, Cabin assignment, and port the passenger Embarked). You can sort on any column in the feature set to observe potential relationships between missingness. There are missing values in Pclass and Age, but they do not appear correlated with other missing values. This is a typical case of MAR, as there might be a reason for the missing values that cannot be directly observed. In this case, it’s probably pretty hard to preserve records after a tragedy.
Another great visualization is the bar chart. It gives more detail on the degree of missingness in each feature. We see that Pclass is around 5 percent missing, Age is 20 percent missing, while Cabin is mostly missing.
The last two charts allow us to explore correlation further (i.e. the dependencies of missingness between columns). The heatmap function shows that there are no strong correlations between missing values of different features. This is good; low correlations further indicate that the data are MAR.
The dendrogram function creates a tree diagram of missingness. It describes the correlations in missingness by grouping similarly missing columns together. Features at zero, fully predict one another’s presence. One variable might always be empty while another is filled, or they might always both be filled, or both empty (missingno documentation)³. Looking at the Titanic dataset, the missingness of Embarked tends to be more similar to Pclass than to Age and so on.
Now that we’ve identified the missingness in the data, let’s discuss a few primary means of dealing with it.
Deletion. PROS: simple; CONS: assumes MCAR and introduces bias if not MCAR. There are three types highlighted below, but you can find a more thorough discussion at Statistic Solutions⁴.
Listwise Deletion (complete case analysis) removes all data for a record that has one or more missing values⁴.Pairwise Deletion (available case analysis) makes use of a record’s features that aren’t missing. For example, a correlation measures the strength between two variables. For each pair of variables for which data is available, the correlation coefficient will take that data into account. Thus, pairwise deletion maximizes all data available on an analysis by analysis basis⁴.Column Deletion is performed when the majority of data for the feature are missing values.
Single Imputation.
Mean/Median Imputation is done by replacing missing values with a measure of central tendency. PROS: relatively simple; CONS: reduces varianceHot Deck Method is a technique where missing values are matched to resembling observations with non-missing values and the missing value is imputed with the value of the similar non-missing observation⁵. Two approaches are nearest neighbors and the matching pattern method. PROS: more educated; CONS— more computationally expensive and time consumingIn Single Regression Imputation, values are predicted from a regression equation. Observations with complete features are used to predict the values of the missing data. Because single regression assumes the missing values fall on the regression line, it reduces variance⁵.Stochastic Regression aims to reduce bias by augmenting each missing value’s predicted value with a residual term. In this manner, the variance is preserved and estimates are unbiased with MAR data⁵.
Multiple Imputation. PROS: unbiased; CONS: complex
Multivariate imputation by chained equations (MICE), is among the more popular multiple imputation approaches. Multiple imputation has a number of advantages over other missing data approaches. Creating multiple imputations, as opposed to single imputations, accounts for the statistical uncertainty in the imputations. In addition, the chained equations approach is very flexible and can handle variables of varying types (e.g., continuous or binary). Multiple imputation involves filling in the missing values multiple times, creating multiple “complete” datasets⁶.
Sensitivity Analysis
Best-worst and worst-best case sensitivity analyses can be used to show the range of uncertainty due to missing data. Sensitivity analyses may show how assumptions, different from those made in the primary analysis influence the results obtained⁷. See this paper from BMC Medical Research for further explanation. When your data is MNAR, this is about the best solution. Otherwise, collect better data.
In this article, I have provided a useful framework and library for classifying and visualizing missing data. Over the remainder of this series, I will apply some of the imputation methods mentioned above on the missing Titanic data and explore the effects of each method.
[1] G. Press, Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says (2016), Forbes
[2] K. Grace-Martin, How to Diagnose the Missing Data Mechanism (2019), The Analysis Factor
[3] Bilogur, Missingno: a missing data visualization suite (2018), Journal of Open Source Software, 3(22), 547
[4] Statistics Solutions, Missing Data: Listwise vs Pairwise, Statistics Solutions blog
[5] I. Eekhout, Single Imputation Methods, Iris Eekhout blog
[6] M. Azur, E. Stuart, C. Frangakis, P. Leaf, Multiple Imputation by Chained Equations: What is it and how does it work? (2011), International Journal of Methods in Psychiatric Research
[7] J. Jakobsen, C. Gluud, J. Wetterslev, P. Winkel, When and how should multiple imputation be used for handling missing data in randomised clinical trials — a practical guide with flowcharts (2017), BMC Med Res Methodol 17, 162