People often ask me “ ” and I often stumble upon this question ‘ ” on Quora too. The answer is pretty much clearly available all over the internet. The actual issue is not how to become a data analyst but it is if we are ready to become one? How can I learn Data Analytics? How to become a Data Analyst This post aims to take a newbie into the world of Data Analytics with a simple freely available public data and R (the open-source champ of Data Science). Data collection: The Data that’s used here is of a recent . Just copy the table data and paste it in MS Excel and save as ‘ ’, your data for analysis is ready. the public leaderboard Analytics Vidya Hackathon av-hackathon.csv Reading Input Data: Let’s read the csv into R Studio using . read.csv mind_lb <- read.csv(‘av-hackathon.csv’,header = T,stringsAsFactors = F) Basic Summary: Understanding the dataset is the primary operation any analyst should perform. We can use or to explore the basic summary of our dataset and to see sample values. str() summary() We could clearly see that there are 1037 ( / ) and 3_variables_/c_olumns_ and their — two of (of which one is just a serial number and the other is Score) and name which is of type. observations rows entries data types numeric character Data Analysis: When we scroll through our dataset (the read input file), we could see some users explicitly having their email id as user name. Can we try to see how many such users have ‘@’ in their user name? Let’s use to match the names containing @ symbol. grepl (regular expression) Using would return us just TRUE/FALSE against each observation but what we actually need is count. so let’s use table() function in R to find it out. grepl(‘@’,mind_lb$Name) returns us the actual count (absolute figures) of usernames with @ and without @. But wouldn’t it’d be better to represent in terms of percentage? table(grepl(‘@’,mind_lb$Name)) prop.table() along with table() which takes values from _grepl(‘@’,mind_lb$Name)_gives us the values in decimals which in turn multiplied with 100 gives us the actual percentage of usernames with @ and without @. Now we know that there’s almost 6.5% usernames with @, obviously must be contributing the most part of it, but can we try to find if there’s something else apart from ? gmail gmail Let’s perform some but rewriting them to understand the logic: logical operation . We want all usernames with @ not gmail @ AND NOT gmail Voila! Here’s the code for it: mind_lb$Name[grepl(‘@’,mind_lb$Name) & !grepl(‘gmail’,mind_lb$Name)] which results in identifying us usernames with @hpe.com @iitbhu @iimahd.ernet.in Some good insight isn’t it? Someone from HP Enterprise, IIT-BHU and IIM-A. Let’s try some more string operations. Usernames would be unique so we can’t really find out repetition in those names. But can we find which alphabet forms the most starting letter? Let’s create a new variable ‘f_letter’ for this purpose. It’s just a operation. substring mind_lb$f_letter <- tolower(substr(mind_lb$Name,0,1)) Data Visualization: Let’s make a simple graph out of it to see which letter wins. library(ggplot2) qplot (data = mind_lb, f_letter) It seems ‘s’ is the winner and ‘a’ is the runner. Well that’s for fun. But can we make some better looking visualization? Let’s draw the hackathon score distribution. The complete code can be found here on my . github Final Notes: Doesn’t it seem easier to find out some valuable insights in a dataset? Data Analytics in fact easier. All you need is an open mind to see through the data and the tool and syntax you select would come handy once you start. This is not a tutorial post but just to show a glimpse of the easiness of R and Data Analytics. Are you ready to dive into the world of Data Analytics? If so, and and start today. Also create a and share your code and visualization and comment the link here. download R R Studio github account