Three weeks into my journey to become a data scientist and I’ve officially been baptized… by fire, that is! I chose to attend Data Science 15-week bootcamp to transition out of finance. So far, the program has exceeded expectations (and my expectations were high). Flatiron’s While the curriculum is rigorous and fast-paced, it’s well constructed, the instructors are dedicated to helping students learn, and my cohort is amazing — everyone is friendly, helpful, smart, and undoubtedly will go on to accomplish great things. This series of blog posts is dedicated to them… Here’s to the future data scientists! I look forward to reading my cohort-mates’ posts as well! We cover a ton of information in class, so when I discover something helpful from outside of class that sharpens my perspective or skills, I’ll use these next several posts to share it. More broadly, this series is for anyone who is interested in data science. When studying data science, one of the first realizations is that data cleaning is time consuming (and many find it the least enjoyable aspect of the job). Estimates show a range of ¹ of time is spent on cleaning and organizing data, with a much smaller portion spent on analysis (the fun part). 60–80 percent With this statistic as a backdrop, I figured I’d share a few useful techniques and libraries that I’ve discovered to make the data cleaning process easier. Specifically, I’ll focus on how to deal with missing data. I’m a visual learner and I like frameworks, so the objective of this post is to provide a basic process for identifying and treating missing data by: — standard vocabulary for classifying missingness (that has implications for treatment of the missing data), and introducing the “missing data mechanism” demonstrating the usefulness of the Python missingno library. Some background Information… For this analysis, I used the Titanic dataset because it’s readily accessible (you can download it ) and it does not require domain knowledge to understand the data (unlike healthcare data, for example). I did not break it up into training and test sets as the goal here is just to cover (and uncover) issues with missing data. Finally, all code is Python. here Step 1: Do some Exploratory Data Analysis (EDA) EDA is critical to understanding the features of the dataset and (particularly relevant to the current topic) uncovering any hidden missing values that might be lurking. You can slice the Titanic data in myriad ways, but here’s some charts I made that should suffice for our purposes. The first row segments the population by sex. As you can see, males outnumber females by about a 65/35 proportion. However, when it comes to survival, females dominate. Close to 75 percent of females survive, compared to under 20 percent of males. For anyone who’s seen the Titanic, this comes as no surprise — women and children boarded lifeboats first. In aggregate, median age for both sexes is pretty close (27 years for females, 29 years for males). The second row is where things get a little more interesting. We see that most people on the ship are 3rd class (in other words relatively poor). It also comes as no surprise that the majority of casualties belong to the 3rd class; the poor boarded lifeboats last, if at all. An additional insight is that the median age varies substantially by class. Finally, there’s a mysterious question mark on all the second row charts — it’s a null value masquerading as a real value! We’ll have to convert the “?” into a null/missing value and decide how to deal with it, and we know that passenger class and sex are likely going to factor into any imputation we may need to do. From EDA we have already gained some key insights that will carry over into our analysis of the missing data. Step 2: Identify the type(s) of missingness in the data The Missing Data Mechanism Identifying the missingness type helps narrow down the methodologies you can use for treating missing data. The Missing Data Mechanism is the standard framework. In it, there are three types of missingness: The missing values have no correlation with other values in the dataset observed or missing. “There is systematic going on that makes some data more likely to be missing than others ( )².” For example, if a few students in a class happen to be absent one day for their own individual reasons, that’s MCAR. “There is a systematic relationship between the propensity of missing values and the data, but the missing data².” Moreover, there’s an underlying reason for missingness that can’t be directly observed. Continuing with our classroom example, it’s flu season and a few students are out sick. There is a relationship between missingness and its values, missing or non-missing². Missingness of a certain value depends on the true value itself. Extending the prior example, lets say there’s a statewide standardized testing day and a number of students are absent. Perhaps the students who feel insecure about standardized testing are most likely to be absent. Whether the missing data are missing because of the actual values is a hard question to answer. In our classroom example, the question would be whether students most likely to perform poorly on the test are missing. Missing Completely at Random (MCAR). nothing Analysis Factor Missing at Random (MAR). observed not Missing Not at Random (MNAR). There are some tests to diagnose the type of missing data, like for MCAR, but the only true way to distinguish between MAR and MNAR is to measure some of the missing data². However, this just isn’t possible in most cases, so Fortunately, there is a library that simplifies missingness pattern assessment. Little’s Test classifying the type of missingness really comes down to analyzing patterns in the missing data and relying on domain expertise to guide decision-making. The first chart to look at is the . In the Titanic dataset, after we replace the “?” values with null values, the matrix shows there are four columns with missing data (Passenger Class, Age, Cabin assignment, and port the passenger Embarked). You can sort on any column in the feature set to observe potential relationships between missingness. There are missing values in Pclass and Age, but they do not appear correlated with other missing values. This is a typical case of MAR, as there might be a reason for the missing values that cannot be directly observed. In this case, it’s probably pretty hard to preserve records after a tragedy. A great way to explore the missing data is to visualize it with missing no. nullity matrix³ Another great visualization is the bar chart. It gives more detail on the degree of missingness in each feature. We see that Pclass is around 5 percent missing, Age is 20 percent missing, while Cabin is mostly missing. The last two charts allow us to explore correlation further (i.e. the dependencies of missingness between columns). The heatmap function shows that there are no strong correlations between missing values of different features. This is good; low correlations further indicate that the data are MAR. The dendrogram function creates a tree diagram of missingness. It describes the correlations in missingness by grouping similarly missing columns together. Features at zero, fully predict one another’s presence. One variable might always be empty while another is filled, or they might always both be filled, or both empty (missingno documentation)³. Looking at the Titanic dataset, the missingness of Embarked tends to be more similar to Pclass than to Age and so on. Step 3: Treat the missing data Now that we’ve identified the missingness in the data, let’s discuss a few primary means of dealing with it. PROS: simple; CONS: assumes MCAR and introduces bias if not MCAR. There are three types highlighted below, but you can find a more thorough discussion at ⁴. Deletion. Statistic Solutions (complete case analysis) removes all data for a record that has one or more missing values⁴. (available case analysis) makes use of a record’s features that aren’t missing. For example, a correlation measures the strength between two variables. For each pair of variables for which data is available, the correlation coefficient will take that data into account. Thus, pairwise deletion maximizes all data available on an analysis by analysis basis⁴. is performed when the majority of data for the feature are missing values. Listwise Deletion Pairwise Deletion Column Deletion Single Imputation. is done by replacing missing values with a measure of central tendency. PROS: relatively simple; CONS: reduces variance is a technique where missing values are matched to resembling observations with non-missing values and the missing value is imputed with the value of the similar non-missing observation⁵. Two approaches are nearest neighbors and the matching pattern method. PROS: more educated; CONS— more computationally expensive and time consumingIn , values are predicted from a regression equation. Observations with complete features are used to predict the values of the missing data. Because single regression assumes the missing values fall on the regression line, it reduces variance⁵. aims to reduce bias by augmenting each missing value’s predicted value with a residual term. In this manner, the variance is preserved and estimates are unbiased with MAR data⁵. Mean/Median Imputation Hot Deck Method Single Regression Imputation Stochastic Regression PROS: unbiased; CONS: complex Multiple Imputation. , is among the more popular multiple imputation approaches. Multiple imputation has a number of advantages over other missing data approaches. Creating multiple imputations, as opposed to single imputations, accounts for the statistical uncertainty in the imputations. In addition, the chained equations approach is very flexible and can handle variables of varying types (e.g., continuous or binary). Multiple imputation involves filling in the missing values multiple times, creating multiple “complete” datasets⁶. Multivariate imputation by chained equations (MICE) Sensitivity Analysis Best-worst and worst-best case sensitivity analyses can be used to show the range of uncertainty due to missing data. Sensitivity analyses may show how assumptions, different from those made in the primary analysis influence the results obtained⁷. See this from BMC Medical Research for further explanation. When your data is MNAR, this is about the best solution. Otherwise, collect better data. paper Conclusion In this article, I have provided a useful framework and library for classifying and visualizing missing data. Over the remainder of this series, I will apply some of the imputation methods mentioned above on the missing Titanic data and explore the effects of each method. References [1] G. Press, (2016), Forbes Cleaning Big Data: Most Time-Consuming, Least Enjoyable Data Science Task, Survey Says [2] K. Grace-Martin, (2019), The Analysis Factor How to Diagnose the Missing Data Mechanism [3] Bilogur, (2018), Journal of Open Source Software, 3(22), 547 Missingno: a missing data visualization suite [4] Statistics Solutions, , Statistics Solutions blog Missing Data: Listwise vs Pairwise [5] I. Eekhout, , Iris Eekhout blog Single Imputation Methods [6] M. Azur, E. Stuart, C. Frangakis, P. Leaf, (2011), International Journal of Methods in Psychiatric Research Multiple Imputation by Chained Equations: What is it and how does it work? [7] J. Jakobsen, C. Gluud, J. Wetterslev, P. Winkel, (2017), BMC Med Res Methodol 17, 162 When and how should multiple imputation be used for handling missing data in randomised clinical trials — a practical guide with flowcharts

Different

Apple Mail - Unmasking The Most Mysterious App In The World

A Framework & Package for Missing Data To Speed Up Data Organization and Data Cleaning

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Apple Mail - Unmasking The Most Mysterious App In The World

10 Ways to Optimize Your Database

10 Best React Native Chart Libraries

10 Best Datasets for Time Series Analysis

12 Mistakes that Data Scientists Make and How to Avoid Them

13 Best Datasets for Power BI Practice

Apple Mail - Unmasking The Most Mysterious App In The World

10 Ways to Optimize Your Database

10 Best React Native Chart Libraries

10 Best Datasets for Time Series Analysis

12 Mistakes that Data Scientists Make and How to Avoid Them

13 Best Datasets for Power BI Practice

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps