Exploratory data analysis plays a role in the work of data science and machine learning. In this post I am giving a brief intro of Exploratory data analysis(EDA) in Python with help of and . pandas matplotlib Let’s start now. What is Exploratory data analysis? According to : Wikipedia In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. You can say that EDA is statisticians way of where you explore data, find patterns and tells insights. Often you have some questions in hand you try to validate those questions by performing EDA. story telling Photo on Unsplash Exploratory data analysis(EDA) With Python Multiple libraries are available to perform basic EDA but I am going to use and matplotlib for this post. Pandas for data manipulation and matplotlib, well, for plotting graphs. to write code and other findings. Jupyter notebooks is kind of diary for data analysis and scientists, a web based platform where you can mix Python, html and Markdown to explain your data insights. pandas Jupyter Nootbooks OK enough talk, it’s time to explore the data and find about it. The data I am using belongs to a fictitious super store. We are going to find answers of a few questions like what are best days in terms of sales, what hours are most suitable so that staff can be placed accordingly. If you don’t have jupyter installed, install it by running the command pip install jupyter If you are using version of Python then Jupyter and Pandas will already be there. I will highly recommend to use it as it’s specially packaged for people who are into data analysis or Data Science. Anaconda Launch Jupyter Note book by running command jupyter-notebook. This command output shown in the below: Anaconda prompt & command prompt It will open an interface in your default browser. Create a new notebook and name it, you will see something like below: You can add text in Markdown format in a cell and in next cell you could start writing code. Jupyter notebooks have extension. This is not Jupyter 101 post, you may find many available on Internet, still I will try to explain as much as I can. *.pynb In first cell I just added a heading about what I am going to do. You can add anything here, even your but it’s good to put things up which helps the notebook reader what you are talking about. selfie OK first we need to acquire data. Data is available in CSV format. After necessary imports I am going to call method of to read CSV file. read_csv pandas import pandas as pdimport numpy as npimport matplotlib.pyplot as plt%matplotlib inline If you notice I added . What it does that it renders plot inline on your page. You will shortly see how it happens. Next up, read CSV and data cleaning where necessary. %matplotlib inline df = pd.read_csv('data.csv')df['Amount'] = df['Amount'].str.replace('$','').str.replace(',','')df['Amount'] = pd.to_numeric(df['Amount']) First I removed the $ sign and then converted the string field into numeric, once done, we should have data in since we are going to perform mathematical operations on this field. By running the cell by hitting something like this will appear: float Shift+Enter One thing more, I see field unnecessary since we only have data of a single store so let’s remove it! BranchName df.drop('BranchName',axis=1, inplace=True)df It will remove the column, makes it to remove in existing without re-assigning it. Run again and it shows data like below: inpace=True DataFrame OK is done, let’s dive into the data and find insights! operation cleanup The very first thing we are going to do is to find out number of records and number of features or columns. For that I am going to execute . When I did this I found the following: df.shape What does it mean? well it’s actually . So here there are total records and columns, as you can count number of columns above as well. rows x columns 4100 9 Alright I did get the idea of total records and columns but um.. I need a detailed summary of this data, for that I am going to run and when I do it outputs: df.describe() OK some interesting information given here. If you see it tells the same record count that is here. You can see all columns have same count which means there are no missing fields there. You can also check an individual column count, say, for , then output show as like blow picture: count 4100 Units You are getting a picture of how data is available, what is mean, min and max along with standard deviation and median. The percentiles are also there. Standard Deviation is quite useful tool to figure out how the data is spread above or below the mean. The higher the value, the less is reliable or vice versa. For instance of is while mean is around . On other hand mean of is and is . Oh, just to clarify that std is short form of Standard Deviation, NOT , just thought to clarify lest you not think that the our data caught some disease. std Amount 183.5 35 Units 12.7 std 17.85 Sexual Transmitted Disease Let’s see the distribution of Amount And it outputs: OK ignore this giant..spike for a while and notice the base line which is very large, varies from to + -1000 1000 Let’s find out Sales by Month, Day and Hour. Sale by Month you can use to get aggregated value of the particular column only or for every column. Since we only need for so I used that. Plot the graph and you find this: .size() .count() Month Life was beautiful till July but then something happened and there was a sharp decline in August, then staff tried hard for next 3 months and then things died again. Let’s see by Day Sale by Day The output is: Sales dropped massively at the end of the month otherwise there were constant hiccup, 18th day is quite good. Sale by Hour And it outputs: OK seems more customers visit in after noon than closing and opening hours. Conclusion Until now, I can go now. You can find more insight, but you can use exploratory data analysis on how to find insight from this data set, as much as I think above. In this data there is a field , your task is to find out no of sales of each transaction type. Let us know how it’s going on. Transaction Type As always, the code of this post is available on . Github You can what you think about this, if you enjoy writing, click on the clap 👏 button. tell me Thanks to everyone.