Official account for all of the HackerNoon newsletters. www.hackernoon.com/u/newsletters
Imagine: You are about to sit down with a newly-fetched data set, excited about the insights it will bring you and then you find out it is no use. If you’ve been there, then you know for sure what an untidy dataset is.
A statistician from New Zealand once said: Tidy datasets are all alike, but every messy dataset is messy in its own way. Indeed, as data may come in various forms and shapes, sometimes we are inundated with it.
As a result, our data science team becomes shortsighted and oops.. disillusioned by mountains of unworkable data. The only way data specialists can facilitate analysis is by keeping data clean and organized.
Essentially, tidy data is a term coined by Hadley Wickham in his Tidy Data paper (remember that statistician from NZ?). He defined tidy data as data that is neatly organized and all set for analysis.
This way of organizing allows you to easily produce charts, diagrams, and summary statistics. As it often happens, not all data comes out of the database clean, therefore cleansing it is essential to efficiently analyze it.
Without further ado, let us break down the principles that allow you keep your data nice and clean.
We’ll start with one of the basic principles. When you are giving your data the once-over, you should make sure each row contains an observation.By definition, observation is the individual unit under question.
If we look at the table above, an observational unit could be called ‘people’. You can see that each person has an individual row on the table and all of the information for that person is in the same row. Observations are included in rows, variables are represented as columns and there is only one observational unit per table. Now THIS is tidy data.
A variable is the unit you are assessing. Again, if we turn to our table above, age, hair_color and height fall within the category of variables.
In tidy data each variable is represented in a separate column.
Okay, now a one-second quiz: What is wrong with this dataset?
Yep, you guessed it right. Never put multiple variables in one column, otherwise your data analysis is doomed.
If you have got hold of the first two principles, this one should already be a no-brainer.
Anyway, we’ll make an extra effort to lay it all out. Each cell should contain only one value. It is also important that all values in the same column are formatted the same way.
On this data set, you can see that we have a table with four variables and three observations.
Each cell contains one piece of information and our values all match. All of our age values are digits, hair color values are whole words – you get the idea. Therefore, this dataset is tidy and almost fit for analysis.
In an ideal dataset, columns should have specific and descriptive names. Let us demonstrate you an example of this principle.
The third column is labeled hair_color. This is a more specific heading that if we were simply to call it – hair.
The word ‘hair’ can refer to anything from hair length to hairstyle. This level of specificity will help you speed up the analysis process.