Python Pandas — Basics to Beyond A tutorial walkthrough of Python Pandas Library For those of you who are getting started with Machine learning, just like me, would have come across Pandas, . In the rush to understand the gimmicks of ML, we often fail to notice the importance of this library. But soon you will hit a roadblock where you would need to play with your data, clean and perform data transformations before feeding it into your ML model. the data analytics library Why do we need this blog when there are already a lot of documentation and tutorials? Pandas, unlike most python libraries, has a steep learning curve. The reason is that you need to understand your data well in order to apply the functions appropriately. Learning Pandas syntactically is not going to get you anywhere. Another problem with Pandas is that there is that there is more than one way to do things. Also, when I started with Pandas it’s extensive and elaborate documentation was overwhelming. I checked out the and that scared me even more. cheatsheets In this blog, I am going to take you through Pandas functionalities by cracking specific use cases that you would need to achieve with a given data. Setup and Installation Before we move on with the code for understanding the features of Pandas, let’s get Pandas installed in your system. I advise you to create a virtual environment and install Pandas inside the virtualenv. Create virtualenv virtualenv -p python3 venv
source venv/bin/activate Install Pandas pip install pandas Jupyter Notebook If you are learning Pandas, I would advise you to dive in and use a jupyter notebook for the same. The visualization of data in jupyter notebooks makes it easier to understand what is going on at each step. pip install jupyter
jupyter notebook Jupyter by default runs in your system-wide installation of python. In order to run it in your virtualenv follow the link and create a user level kernel https://anbasile.github.io/programming/2017/06/25/jupyter-venv/ Sample Data I created a simple purchase order data. It comprises of sales data of each salesperson of a company over countries and their branches at different regions in each country. [Here is a link to the spreadsheet for you to download.](http://Region	Sales Person	Date of Purchase	Total	Quantity																		 2 India	North	John	8/1/2011 0:00:00	100000	567																		 3 US	North	Bill	12/26/2011 0:00:00	120000	3000																		 4 UK	North	Thomas	8/6/2015 0:00:00	140000	345																		 5 Australia	East	John	1/8/2010 0:00:00	160000	1000																		 6 Africa	East	Bill	7/15/2010 0:00:00	180000	123																		 7 Singapore	East	Thomas	3/29/2012 0:00:00	200000	1000																		 8 Mylasia	West	John	12/30/2013 0:00:00	1000000	7890																		 9 India	West	Bill	4/8/2016 0:00:00	240000	200																		 10 US	West	Thomas	9/4/2018 0:00:00	26000000	1000																		 11 UK	North	John	5/19/2015 0:00:00	100000	1000																		 12 Australia	North	Bill	4/15/2012 0:00:00	120000	567																		 13 Africa	North	Thomas	8/11/2015 0:00:00	140000	1000																		 14 Singapore	East	John	5/9/2017 0:00:00	160000	892																		 15 Mylasia	East	Bill	12/13/2013 0:00:00	180000	444																		 16 India	East	Thomas	3/3/2013 0:00:00	200000	90																		 17 US	West	John	8/26/2015 0:00:00	220000	90																		 18 UK	West	Bill	4/13/2012 0:00:00	240000	90																		 19 Australia	West	Thomas	6/28/2013 0:00:00	260000	90																		 20 Africa	North	John	10/5/2012 0:00:00	140000	85																		 21 Singapore	North	Bill	1/7/2016 0:00:00	150000	85																		 22) Load data into Pandas With Pandas, we can load data from different sources. Few of them are loading from CSV or a remote URL or from a database. The loaded data is stored in a Pandas data structure called DataFrame. DataFrame’s are usually refered by the variable name df . So, anytime you see df from here on you should be associating it with Dataframe. From CSV File import pandas
df = pandas.read_csv("path_to_csv") From Remote URL You can pass a remote URL to the CSV file in read_csv. import pandas
df = pandas.read_csv("remote/url/path/pointing/to/csv") From DB In order to read from Database, read the data from DB into a python list and use DataFrame() to create one db = # Create DB connection object 
cur = db.cursor()
cur.execute("SELECT * FROM <TABLE>")
df = pd.DataFrame(cur.fetchall()) Each of the above snippets reads data from a source and loads it into Pandas’ internal data structure called DataFrame Understanding Data Now that we have the Dataframe ready let’s go through it and understand what’s inside it # 1. shows you a gist of the data
df.head()

# 2. Some statistical information about your data
df.describe()

# 3. List of columns headers
df.columns.values Pick & Choose your Data Now that we have loaded our data into a DataFrame and understood its structure, let’s pick and choose and perform visualizations on the data. When it comes to selecting your data, you can do it with both Indexesor based on certain conditions. In this section, let’s go through each one of these methods. Indexes Indexes are labels used to refer to your data. These labels are usually your column headers. For eg., Country, Region, Quantity Etc., Selecting Columns # 1. Create a list of columns to be selected
columns_to_be_selected = ["Total", "Quantity", "Country"]

# 2. Use it as an index to the DataFrame
df[columns_to_be_selected]

# 3. Using loc method
df.loc[columns_to_be_selected] Selecting Rows Unlike the columns, our current DataFrame does not have a label which we can use to refer the row data. But like arrays, DataFrame provides numerical indexing(0, 1, 2…) by default. # 1. using numerical indexes - iloc
df.iloc[0:3, :]

# 2. using labels as index - loc
row_index_to_select = [0, 1, 4, 5]
df.loc[row_index_to_select] Filtering Rows Now, in a real-time scenario, you would most probably not want to select rows based on an index. An actual real-life requirement would be to filter out the rows that satisfy a certain condition. With respect to our dataset, we can filter by any of the following conditions 1. Total sales > 200000
df[df["Total"] > 200000]

2. Total sales > 200000 and in UK
df[(df["Total"] > 200000) & (df["Country"] == "UK")] Playing With Dates Most of the times when dealing with date fields we don’t use them as it is. Pandas make it really easy for you to project Date/Month/Year from it and perform operations on top of it In our sample dataset, the Date_of_purchase is of type string, hence the first step would be to convert them to the DateTime type. >>> type(df['Date of Purchase'].iloc[0])
str Converting Column to DateTime Object >>> df['Date of Purchase'] = pd.to_datetime(df['Date of Purchase'])
>>> type(df['Date of Purchase'].iloc[0])
pandas._libs.tslibs.timestamps.Timestamp Extracting Date, Month & Year df['Date of Purchase'].dt.date    # 11-09-2018
df['Date of Purchase'].dt.day     # 11
df['Date of Purchase'].dt.month   # 09
df['Date of Purchase'].dt.year    # 2018 Grouping Statistical operations You can perform statistical operations such as min, max, mean etc., over one or more columns of a Dataframe. df["Total"].sum()
df[["Total", "Quantity"]].mean()
df[["Total", "Quantity"]].min()
df[["Total", "Quantity"]].max()
df[["Total", "Quantity"]].median()
df[["Total", "Quantity"]].mode() Now in a real-world application, the raw use of these statistical functions are rare, often you might want to group data based on specific parameters and derive a gist of the data. Let’s look at an example where we look at the country-wise, country & Region-wise sales. # 1. Country wise sales and Quantity
df.groupby("Country").sum()

# 2. Quantity of sales over each country & Region
df.groupby(["Country", "Region"])["Quantity"].sum()

# 3. More than one aggregation
df.groupby(["Country", "Region"]).agg(
                     {'Total':['sum', 'max'], 
                      'Quantity':'mean'}) Pivot Table Pivot Table is an advanced version of groupby, where you can stack dimensions over both rows and columns. i.e., as the data grows the groupby above is going to grow in length and will become hard to derive insights, hence a well-defined way to look at it would be Pivot tables import numpy as np
df.pivot_table(index=["Country"], 
               columns=["Region"], 
               values=["Quantity"], 
               aggfunc=[np.sum]) Another advantage of the Pivot Table is that you can add as many dimensions and functions you want. It also calculates a grand total value for you import numpy as np
df.pivot_table(index=["Country"], 
               columns=["Region","Requester"], 
               values=["Quantity"], 
               aggfunc=[np.sum], 
               margins=True,
               margins_name="Grand Total") Okay, that was a lot of information in 5 minutes. Take some time in trying out the above exercises. In the next blog, I will walk you through some more deeper concepts and magical visualizations that you can create with Pandas. Every time you start learning Pandas, there is a good chance that you may get lost in the Pandas jargons like index, functions, numpy etc., But don’t let that get to you. What you really have to understand is that Pandas is a tool to visualize and get a deeper understanding of your data. With that mindset take a sample dataset from your spreadsheet and try deriving some insights out of it. Share what you learn. Here is the link to my jupyter notebook for you to get started. Did the blog nudge a bit to give Pandas another chance? Hold the “claps” icon and give a shout out to me on twitter . Follow to stay tuned on future blogs

Google

Twitter

Python Pandas — Basics to Beyond

About Author

Comments

TOPICS

THIS ARTICLE WAS FEATURED IN

Related Stories

Caches in Python

What Are Convolution Neural Networks? [ELI5]

The Noonification: Have U Been Pwned? (1/12/2023)

Goldman Sachs, Data Lineage, and Harry Potter Spells

People are still crazy about Python after twenty-five years

10 Questions to Consider when Setting up a Corporate A.I project

Caches in Python

What Are Convolution Neural Networks? [ELI5]

The Noonification: Have U Been Pwned? (1/12/2023)

Goldman Sachs, Data Lineage, and Harry Potter Spells

People are still crazy about Python after twenty-five years

10 Questions to Consider when Setting up a Corporate A.I project

Light-Mode

Classic

Newspaper

Minty

Dark-Mode

Neon Noir

Minty

HN StartUps