For those of you who are getting started with Machine learning, just like me, would have come across Pandas, the data analytics library. In the rush to understand the gimmicks of ML, we often fail to notice the importance of this library. But soon you will hit a roadblock where you would need to play with your data, clean and perform data transformations before feeding it into your ML model.
Why do we need this blog when there are already a lot of documentation and tutorials? Pandas, unlike most python libraries, has a steep learning curve. The reason is that you need to understand your data well in order to apply the functions appropriately. Learning Pandas syntactically is not going to get you anywhere. Another problem with Pandas is that there is that there is more than one way to do things. Also, when I started with Pandas it’s extensive and elaborate documentation was overwhelming. I checked out the cheatsheets and that scared me even more.
In this blog, I am going to take you through Pandas functionalities by cracking specific use cases that you would need to achieve with a given data.
Before we move on with the code for understanding the features of Pandas, let’s get Pandas installed in your system. I advise you to create a virtual environment and install Pandas inside the virtualenv.
virtualenv -p python3 venv
source venv/bin/activate
pip install pandas
If you are learning Pandas, I would advise you to dive in and use a jupyter notebook for the same. The visualization of data in jupyter notebooks makes it easier to understand what is going on at each step.
pip install jupyter
jupyter notebook
Jupyter by default runs in your system-wide installation of python. In order to run it in your virtualenv follow the link and create a user level kernel https://anbasile.github.io/programming/2017/06/25/jupyter-venv/
I created a simple purchase order data. It comprises of sales data of each salesperson of a company over countries and their branches at different regions in each country. [Here is a link to the spreadsheet for you to download.](http://Region Sales Person Date of Purchase Total Quantity 2 India North John 8/1/2011 0:00:00 100000 567 3 US North Bill 12/26/2011 0:00:00 120000 3000 4 UK North Thomas 8/6/2015 0:00:00 140000 345 5 Australia East John 1/8/2010 0:00:00 160000 1000 6 Africa East Bill 7/15/2010 0:00:00 180000 123 7 Singapore East Thomas 3/29/2012 0:00:00 200000 1000 8 Mylasia West John 12/30/2013 0:00:00 1000000 7890 9 India West Bill 4/8/2016 0:00:00 240000 200 10 US West Thomas 9/4/2018 0:00:00 26000000 1000 11 UK North John 5/19/2015 0:00:00 100000 1000 12 Australia North Bill 4/15/2012 0:00:00 120000 567 13 Africa North Thomas 8/11/2015 0:00:00 140000 1000 14 Singapore East John 5/9/2017 0:00:00 160000 892 15 Mylasia East Bill 12/13/2013 0:00:00 180000 444 16 India East Thomas 3/3/2013 0:00:00 200000 90 17 US West John 8/26/2015 0:00:00 220000 90 18 UK West Bill 4/13/2012 0:00:00 240000 90 19 Australia West Thomas 6/28/2013 0:00:00 260000 90 20 Africa North John 10/5/2012 0:00:00 140000 85 21 Singapore North Bill 1/7/2016 0:00:00 150000 85 22)
With Pandas, we can load data from different sources. Few of them are loading from CSV or a remote URL or from a database. The loaded data is stored in a Pandas data structure called DataFrame. DataFrame’s are usually refered by the variable name df . So, anytime you see df from here on you should be associating it with Dataframe.
import pandas
df = pandas.read_csv("path_to_csv")
You can pass a remote URL to the CSV file in read_csv.
import pandas
df = pandas.read_csv("remote/url/path/pointing/to/csv")
In order to read from Database, read the data from DB into a python list and use DataFrame() to create one
db = # Create DB connection object
cur = db.cursor()
cur.execute("SELECT * FROM <TABLE>")
df = pd.DataFrame(cur.fetchall())
Each of the above snippets reads data from a source and loads it into Pandas’ internal data structure called DataFrame
Now that we have the Dataframe ready let’s go through it and understand what’s inside it
# 1. shows you a gist of the data
df.head()
# 2. Some statistical information about your data
df.describe()
# 3. List of columns headers
df.columns.values
Now that we have loaded our data into a DataFrame and understood its structure, let’s pick and choose and perform visualizations on the data. When it comes to selecting your data, you can do it with both Indexesor based on certain conditions. In this section, let’s go through each one of these methods.
Indexes are labels used to refer to your data. These labels are usually your column headers. For eg., Country, Region, Quantity Etc.,
# 1. Create a list of columns to be selected
columns_to_be_selected = ["Total", "Quantity", "Country"]
# 2. Use it as an index to the DataFrame
df[columns_to_be_selected]
# 3. Using loc method
df.loc[columns_to_be_selected]
Unlike the columns, our current DataFrame does not have a label which we can use to refer the row data. But like arrays, DataFrame provides numerical indexing(0, 1, 2…) by default.
# 1. using numerical indexes - iloc
df.iloc[0:3, :]
# 2. using labels as index - loc
row_index_to_select = [0, 1, 4, 5]
df.loc[row_index_to_select]
Now, in a real-time scenario, you would most probably not want to select rows based on an index. An actual real-life requirement would be to filter out the rows that satisfy a certain condition. With respect to our dataset, we can filter by any of the following conditions
1. Total sales > 200000
df[df["Total"] > 200000]
2. Total sales > 200000 and in UK
df[(df["Total"] > 200000) & (df["Country"] == "UK")]
Most of the times when dealing with date fields we don’t use them as it is. Pandas make it really easy for you to project Date/Month/Year from it and perform operations on top of it
In our sample dataset, the Date_of_purchase is of type string, hence the first step would be to convert them to the DateTime type.
>>> type(df['Date of Purchase'].iloc[0])
str
>>> df['Date of Purchase'] = pd.to_datetime(df['Date of Purchase'])
>>> type(df['Date of Purchase'].iloc[0])
pandas._libs.tslibs.timestamps.Timestamp
Extracting Date, Month & Year
df['Date of Purchase'].dt.date # 11-09-2018
df['Date of Purchase'].dt.day # 11
df['Date of Purchase'].dt.month # 09
df['Date of Purchase'].dt.year # 2018
You can perform statistical operations such as min, max, mean etc., over one or more columns of a Dataframe.
df["Total"].sum()
df[["Total", "Quantity"]].mean()
df[["Total", "Quantity"]].min()
df[["Total", "Quantity"]].max()
df[["Total", "Quantity"]].median()
df[["Total", "Quantity"]].mode()
Now in a real-world application, the raw use of these statistical functions are rare, often you might want to group data based on specific parameters and derive a gist of the data.
Let’s look at an example where we look at the country-wise, country & Region-wise sales.
# 1. Country wise sales and Quantity
df.groupby("Country").sum()
# 2. Quantity of sales over each country & Region
df.groupby(["Country", "Region"])["Quantity"].sum()
# 3. More than one aggregation
df.groupby(["Country", "Region"]).agg(
{'Total':['sum', 'max'],
'Quantity':'mean'})
Pivot Table is an advanced version of groupby, where you can stack dimensions over both rows and columns. i.e., as the data grows the groupby above is going to grow in length and will become hard to derive insights, hence a well-defined way to look at it would be Pivot tables
import numpy as np
df.pivot_table(index=["Country"],
columns=["Region"],
values=["Quantity"],
aggfunc=[np.sum])
Another advantage of the Pivot Table is that you can add as many dimensions and functions you want. It also calculates a grand total value for you
import numpy as np
df.pivot_table(index=["Country"],
columns=["Region","Requester"],
values=["Quantity"],
aggfunc=[np.sum],
margins=True,
margins_name="Grand Total")
Okay, that was a lot of information in 5 minutes. Take some time in trying out the above exercises. In the next blog, I will walk you through some more deeper concepts and magical visualizations that you can create with Pandas.
Every time you start learning Pandas, there is a good chance that you may get lost in the Pandas jargons like index, functions, numpy etc., But don’t let that get to you. What you really have to understand is that Pandas is a tool to visualize and get a deeper understanding of your data.
With that mindset take a sample dataset from your spreadsheet and try deriving some insights out of it. Share what you learn. Here is the link to my jupyter notebook for you to get started.
Did the blog nudge a bit to give Pandas another chance?Hold the “claps” icon and give a shout out to me on twitter. Follow to stay tuned on future blogs