Exploratory data analysis plays a role in the work of data science and machine learning.\n----------------------------------------------------------------------------------------\n\nIn this post I am giving a brief intro of Exploratory data analysis(EDA) in Python with help of `pandas` and `matplotlib`.\n\nLet’s start now.\n\n### What is Exploratory data analysis?\n\nAccording to [**Wikipedia**](https://en.wikipedia.org/wiki/Exploratory_data_analysis):\n\n> In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.\n\nYou can say that EDA is statisticians way of _story telling_ where you explore data, find patterns and tells insights. Often you have some questions in hand you try to validate those questions by performing EDA.\n\n!(https://hackernoon.com/hn-images/1*rjlQ_fHUS2VdIHMK0EMZRQ.jpeg)\n\n!(https://hackernoon.com/hn-images/1*yXgmrwm04eibYXM5P5-kBw.jpeg)\n\nPhoto on [Unsplash](https://unsplash.com/search/photos/analysis?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText)\n\n### Exploratory data analysis(EDA) With Python\n\nMultiple libraries are available to perform basic EDA but I am going to use [**pandas**](https://pandas.pydata.org/) and matplotlib for this post. Pandas for data manipulation and matplotlib, well, for plotting graphs. [**Jupyter Nootbooks**](http://jupyter.org/) to write code and other findings. Jupyter notebooks is kind of diary for data analysis and scientists, a web based platform where you can mix Python, html and Markdown to explain your data insights.\n\nOK enough talk, it’s time to explore the data and find about it. The data I am using belongs to a fictitious super store. We are going to find answers of a few questions like what are best days in terms of sales, what hours are most suitable so that staff can be placed accordingly.\n\nIf you don’t have jupyter installed, install it by running the command `pip install jupyter`\n\nIf you are using [**Anaconda**](https://anaconda.org/) version of Python then Jupyter and Pandas will already be there. I will highly recommend to use it as it’s specially packaged for people who are into data analysis or Data Science.\n\nLaunch Jupyter Note book by running command `jupyter-notebook.`\n\nThis command output shown in the below:\n\n!(https://hackernoon.com/hn-images/1*n8uw4m2wOk8Xss8_MMNIwA.png)\n\n!(https://hackernoon.com/hn-images/1*NG2DI3Sff7KmpPv6YJUSHg.png)\n\nAnaconda prompt & command prompt\n\nIt will open an interface in your default browser. Create a new notebook and name it, you will see something like below:\n\n!(https://hackernoon.com/hn-images/1*zEhQS5ZG4ti9uBXt7hWNhA.png)\n\nYou can add text in Markdown format in a cell and in next cell you could start writing code. Jupyter notebooks have `*.pynb` extension. This is not Jupyter 101 post, you may find many available on Internet, still I will try to explain as much as I can.\n\nIn first cell I just added a heading about what I am going to do. You can add anything here, even your [**selfie**](https://2static.fjcdn.com/pictures/Horrible+selfie_a9f326_5690097.jpg) but it’s good to put things up which helps the notebook reader what you are talking about.\n\nOK first we need to acquire data. Data is available in CSV format. After necessary imports I am going to call `read_csv` method of `pandas` to read CSV file.\n\nimport pandas as pd \nimport numpy as np \nimport matplotlib.pyplot as plt \n%matplotlib inline\n\nIf you notice I added `%matplotlib inline`. What it does that it renders plot inline on your page. You will shortly see how it happens. Next up, read CSV and data cleaning where necessary.\n\ndf = pd.read\\_csv('data.csv') \ndf\\['Amount'\\] = df\\['Amount'\\].str.replace('$','').str.replace(',','') \ndf\\['Amount'\\] = pd.to\\_numeric(df\\['Amount'\\])\n\nFirst I removed the $ sign and then converted the string field into numeric, once done, we should have data in `float` since we are going to perform mathematical operations on this field. By running the cell by hitting `Shift+Enter` something like this will appear:\n\n!(https://hackernoon.com/hn-images/1*21z8TBM5dsJ72SMGAdsFDw.png)\n\nOne thing more, I see `BranchName` field unnecessary since we only have data of a single store so let’s remove it!\n\ndf.drop('BranchName',axis=1, inplace=True) \ndf\n\nIt will remove the column, `inpace=True` makes it to remove in existing `DataFrame` without re-assigning it. Run again and it shows data like below:\n\n!(https://hackernoon.com/hn-images/1*oZxg7YmLC5s3Z9JnhZmktA.png)\n\nOK _operation cleanup_ is done, let’s dive into the data and find insights!\n\nThe very first thing we are going to do is to find out number of records and number of features or columns. For that I am going to execute `df.shape`. When I did this I found the following:\n\n!(https://hackernoon.com/hn-images/1*JfcY06s9FLaY7RdqvP1zNQ.png)\n\nWhat does it mean? well it’s actually **rows x columns**. So here there are **4100**total records and **9** columns, as you can count number of columns above as well.\n\nAlright I did get the idea of total records and columns but um.. I need a detailed summary of this data, for that I am going to run `df.describe()` and when I do it outputs:\n\n!(https://hackernoon.com/hn-images/1*dee76UlotOIIHUlu52yMzA.png)\n\nOK some interesting information given here. If you see `count` it tells the same record count that is `4100` here. You can see all columns have same count which means there are no missing fields there. You can also check an individual column count, say, for `Units`, then output show as like blow picture:\n\n!(https://hackernoon.com/hn-images/1*cCORDjrWudRTDSDgHc1e6Q.png)\n\nYou are getting a picture of how data is available, what is mean, min and max along with standard deviation and median. The percentiles are also there. Standard Deviation is quite useful tool to figure out how the data is spread above or below the mean. The higher the value, the less is reliable or vice versa. For instance `std` of `Amount` is `183.5` while mean is around `35`. On other hand mean of `Units` is `12.7` and `std` is `17.85`. Oh, just to clarify that std is short form of Standard Deviation, NOT _Sexual Transmitted Disease_, just thought to clarify lest you not think that the our data caught some disease.\n\nLet’s see the distribution of `Amount`\n\n!(https://hackernoon.com/hn-images/1*6mkAY8oAYfvmYUJDz-FJSg.png)\n\nAnd it outputs:\n\n!(https://hackernoon.com/hn-images/1*GC4rmdqxLcHmXnAQdGD12w.png)\n\nOK ignore this giant..spike for a while and notice the base line which is very large, varies from `-1000` to `1000`+\n\nLet’s find out Sales by Month, Day and Hour.\n\n### Sale by Month\n\n!(https://hackernoon.com/hn-images/1*h7h3MEeOebyibRC6UHw_pA.png)\n\nyou can use `.size()` to get aggregated value of the particular column only or `.count()` for every column. Since we only need for `Month` so I used that. Plot the graph and you find this:\n\n!(https://hackernoon.com/hn-images/1*iQ-4FHM5AZZYyl878uXIuw.png)\n\nLife was beautiful till July but then something happened and there was a sharp decline in August, then staff tried hard for next 3 months and then things died again.\n\nLet’s see by Day\n\n### Sale by Day\n\n!(https://hackernoon.com/hn-images/1*T5kFx8Aq9p-3bokAxNS8fQ.png)\n\nThe output is:\n\n!(https://hackernoon.com/hn-images/1*N7BjBbh88-BLzg0IaHGc7w.png)\n\nSales dropped massively at the end of the month otherwise there were constant hiccup, 18th day is quite good.\n\n### Sale by Hour\n\n!(https://hackernoon.com/hn-images/1*RPhKbWioSt0xWRD-tzU_Mg.png)\n\nAnd it outputs:\n\n!(https://hackernoon.com/hn-images/1*Zs3CgC3AncNGNnTotYXFVg.png)\n\nOK seems more customers visit in after noon than closing and opening hours.\n\n### Conclusion\n\nUntil now, I can go now. You can find more insight, but you can use exploratory data analysis on how to find insight from this data set, as much as I think above. In this data there is a field `Transaction Type`, your task is to find out no of sales of each transaction type. Let us know how it’s going on.\n\nAs always, the code of this post is available on [**Github**](https://github.com/harunshimanto/Super-Store-s-Exploratory-data-analysis-EDA-in-Python).\n\nYou can [tell me](https://www.linkedin.com/in/harun-ur-rashid6647/) what you think about this, if you enjoy writing, click on the clap 👏 button.\n\nThanks to everyone.