People often ask me “How can I learn Data Analytics?” and I often stumble upon this question ‘How to become a Data Analyst” on Quora too. The answer is pretty much clearly available all over the internet. The actual issue is not how to become a data analyst but it is if we are ready to become one?
This post aims to take a newbie into the world of Data Analytics with a simple freely available public data and R (the open-source champ of Data Science).
Data collection:
The Data that’s used here is the public leaderboard of a recent Analytics Vidya Hackathon. Just copy the table data and paste it in MS Excel and save as ‘av-hackathon.csv’, your data for analysis is ready.
Reading Input Data:
Let’s read the csv into R Studio using read.csv.
mind_lb <- read.csv(‘av-hackathon.csv’,header = T,stringsAsFactors = F)
Basic Summary:
Understanding the dataset is the primary operation any analyst should perform. We can use str() or summary() to explore the basic summary of our dataset and to see sample values.
We could clearly see that there are 1037 observations (rows/entries) and 3_variables_/c_olumns_ and their data types — two of numeric (of which one is just a serial number and the other is Score) and name which is of character type.
Data Analysis:
When we scroll through our dataset (the read input file), we could see some users explicitly having their email id as user name. Can we try to see how many such users have ‘@’ in their user name?
Let’s use grepl (regular expression) to match the names containing @ symbol.
Using grepl(‘@’,mind_lb$Name) would return us just TRUE/FALSE against each observation but what we actually need is count. so let’s use table() function in R to find it out.
table(grepl(‘@’,mind_lb$Name)) returns us the actual count (absolute figures) of usernames with @ and without @. But wouldn’t it’d be better to represent in terms of percentage?
prop.table() along with table() which takes values from _grepl(‘@’,mind_lb$Name)_gives us the values in decimals which in turn multiplied with 100 gives us the actual percentage of usernames with @ and without @.
Now we know that there’s almost 6.5% usernames with @, obviously gmail must be contributing the most part of it, but can we try to find if there’s something else apart from gmail?
Let’s perform some logical operation. We want all usernames with @ but not gmail rewriting them to understand the logic: @ AND NOT gmail Voila! Here’s the code for it:
mind_lb$Name[grepl(‘@’,mind_lb$Name) & !grepl(‘gmail’,mind_lb$Name)]
which results in identifying us usernames with
Some good insight isn’t it? Someone from HP Enterprise, IIT-BHU and IIM-A.
Let’s try some more string operations. Usernames would be unique so we can’t really find out repetition in those names. But can we find which alphabet forms the most starting letter?
Let’s create a new variable ‘f_letter’ for this purpose. It’s just a substring operation.
mind_lb$f_letter <- tolower(substr(mind_lb$Name,0,1))
Data Visualization:
Let’s make a simple graph out of it to see which letter wins.
library(ggplot2)qplot(data = mind_lb, f_letter)
It seems ‘s’ is the winner and ‘a’ is the runner. Well that’s for fun. But can we make some better looking visualization? Let’s draw the hackathon score distribution.
The complete code can be found here on my github.
Final Notes:
Doesn’t it seem easier to find out some valuable insights in a dataset? Data Analytics in fact easier. All you need is an open mind to see through the data and the tool and syntax you select would come handy once you start.
This is not a tutorial post but just to show a glimpse of the easiness of R and Data Analytics.
Are you ready to dive into the world of Data Analytics? If so, download R and R Studio and start today. Also create a github account and share your code and visualization and comment the link here.