I figure since you have found yourself navigating to this page that you probably have a good amount of data that you are looking to analyze, and you may possibly be wondering the best and most efficient way to answer some of your questions about your data. The answer to your questions can be found with the use of the python package: Pandas.
Due to the popularity of Pandas it has its own conventional abbreviation, so anytime you are importing pandas into python, use the nomenclature below:
import pandas as pd
The pandas API defines a pandas dataframe as:
Two-dimensional, size-mutable, potentially heterogeneous tabular data. Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.
Basically, all that means is that you have data contained in the format to what you see below. Data found in rows and columns:
Example dataframe with labels for data, rows and columns. Dataset from https://www.kaggle.com/yamaerenay/spotify-dataset-19212020-160k-tracks
DataFrames are extremely useful because they provide a simple method to print a table visualization and then manipulate it the way you want it to be. The rows can be easily referenced by the index which are the numbers on the far left of the dataframe. The index will be the corresponding row’s number starting at zero, unless you specify names to each row. The columns can also be easily referenced by the column name such as “Track Name” or by their position in the dataframe. We will talk in more detail about referencing rows and columns later in this article.
Photo by Rhett Wesley on Unsplash
There are several ways to create a pandas dataframe:
In general, you will mainly be putting data from a .csv file, or some type of data source (i.e. SQL database), into a pandas dataframe. You will not be making it from scratch since that would take an insanely long time to do depending on the amount of data you have. However, in the off chance that you need to, here is a quick, simple example from a python dictionary:
import pandas as pd
dict1 = {'Exercises': ['Running','Walking','Cycling'],
'Mileage': [250, 1000, 550]}
df = pd.DataFrame(dict1)
df
Output:
Basic dataframe made from code above
The dictionary keys (‘Exercises’ and ‘Mileage’) become the corresponding column headers. The values in the dictionary being the lists in this example became the individual data points in the dataframe. The order that the lists are in matters since Running will be placed in the first row since it is first in the ‘Exercises’ list and 250 will be placed in first row in the second column since it is first in the ‘Mileage’ list. Also, you will notice that since I did not specify labels for the index of the dataframe it automatically was labelled 0,1, and 2.
However, like I said before the most likely way that you will be creating a pandas dataframe is from a csv or other type of file that you will import in to analyze the data. This is easily completed with just the following:
df = pd.read_csv("file_location.../file_name.csv")
pd.read_csv() is an extremely strong and versatile method and will be extremely useful depending on how you are looking to import your data. If your csv file already comes with headers or an index, you can specify this while importing and make you life so much easier. In order to understand the full ability of pd.read_csv() I suggest you look at the pandas API here.
Now, you have imported your data into your python editor and you are ready to analyze it. However, before we get to the juicy part of answering your analytical questions you have to become familiar with your data and see what it looks like. As the person who is analyzing this data, you have to become comfortable with the dataset. There are four methods that I love to use in order to get to know my data and which pandas makes super easy.
raw_song.head()
The line above is the line that I have in the picture at the top of the page. It will display the first 5 lines of the dataframe and each of the columns to give you an easy summary of what the data looks like. You can also specify a certain number of rows inside the () of the method to show more rows if your heart so desires.
.head() method on song data from Spotify dataset
.tail() is the same just displaying the last 5 lines.
raw_song.tail()
.tail() method on song data from Spotify dataset
From these two quick methods, I have a general idea or the column names and just what the data looks like just from a small sample of the dataset. These methods are also really useful especially given a dataset such as the Spotify dataset working with over 3 million rows, you can easily display the dataset and get a quick idea, and it won’t take your computer a long time to display the data.
.info() is also useful in that it shows me a succinct list of all of the columns, their datatypes, and whether you have any null datapoints or not.
raw_song.info(verbose=True, null_counts=True)
.info() method on song data from Spotify dataset
If you have completely integer or float columns (i.e. ‘Position’, ‘Streams’), then .describe() can be a useful method to understand more about your dataset as it will show many descriptive statistics about those columns.
raw_song.describe()
.describe() method on song data from Spotify dataset. Notice that only ‘Position’ and ‘Streams’ columns are shown since they were the only two integer columns, the other columns are strings and do not have descriptive stats.
Lastly, .sample() will allow you to randomly sample your dataframe and see if any manipulation that you made has incorrectly changed something in your dataset, and can also be great when first exploring your dataset just to get an idea of what the dataset contains exactly that was not already shown in the previous methods.
raw_song.sample(10)
.sample() method on song data from Spotify dataset.
I use each of these methods consistently while exploring and preparing my datasets for analysis. Anytime I change the data in a column, change a column name, or add/delete a row/column I will then make sure it all changed the way I wanted it to by quickly running at least some of the previous 5 methods.
Awesome, now you know how to look at your dataset as a whole, but you really just want to look at a few columns or a few rows and get the rest out of your way.
.loc[] and .iloc[]
These two methods will do exactly that just in different ways depending on the way in which you are able to reference a certain row or column.
If you know the label of a row or a column, use .loc[].
If you know the index of a row or a column, use .iloc[].
If you know both, just pick your favorite, flip a coin, or use the one that you know will not potentially change. For example, if you add rows or columns to your dataframe the index will change on these rows/columns and may cause you to reference the wrong row/column later.
Thus, going back to the Spotify dataset. You can just look at the column “Track Name” by using either .loc[] or .iloc[]. With .loc[] since I know the label of the column I would use the following:
raw_song.loc[:,'Track Name']
The colon after the first bracket specifies which rows I am referencing, and since I want all rows in the “Track Name” column, I am using the “:”.
.loc[]
I will receive the same output with .iloc[] only this time the index of the “Track Name” column needs to be specified:
raw_song.iloc[:,1]
.iloc[]
.loc[] and .iloc[] work the same for rows, except in this case since both the labels and the indices for the rows are the same, they would look exactly the same.
Photo by Annie Spratt on Unsplash
Another easy way to get a portion of the DataFrame is just by using [] and specifying the column name inside of the brackets.
raw_song[['Artist','Streams']].head()
If you use only one column and one set of brackets you will receive a Pandas Series.
raw_song['Streams']
Utilizing what we already know from .loc[] we can use this to add a row or a column to a dataframe. You can also add columns in two other ways using .insert() or by acting as you are adding a slice of a dataframe and specifying the name of the column inside square brackets []. If you are trying to add multiple rows and columns you can create a separate dataframe and join this new dataframe of the new column or row to your original dataframe to add that column or row. To do so you would use pd.merge(), concat(), or join(); however, further discussion and examples of these methods will be discussed in a future article and are outside the scope of this post.
Adding a row:
If you decide to use .loc[] to add a row to a dataframe, you are only able to add it to the bottom of the dataframe. Specifying any other index in the dataframe with erase the data that is currently in that row and replace it with the data that you are inserting. In this example, I just name a new index as ‘last’ and it shows up at the bottom of the dataframe. Notice that it doesn’t have to be a specific name, just as long as it is not the same as any of the other indices.
raw_song.loc['last'] = [0,'hello','bluemen',1,"https://open.spotify.com/track/notarealtrack", '2017-02-05','ec']
You can do the same to add a column to a dataframe using .loc[]. Once again, create a new name for the column unless you are trying to replace a column with new values. You must put either a single value that can be copied for all of the rows, or a list with the same length as the number of rows in the dataframe.
raw_song.loc[:,'new_col'] = 0
raw_song.tail()
.loc[] to add a column to the Spotify dataset. I used 0 to simplify the input.
There are two other ways to insert a new column into your dataframe other than at the end.
The method .insert() will allow you to specify where you would like the column to be placed into your dataframe. It takes 3 arguments, the index where it is to be placed, the name of the new column, and the values to be placed as the column data.
raw_song.insert(2,'new_col',0)
raw_song.tail()
Using .insert() to make a new column in the Spotify dataset.
The second way to add a column to your dataframe is by acting like it is a slice of the dataframe by using the [] to name the new column and have it equal the new data.
raw_song['new_col'] = 0
raw_song.tail()
adding a new column to the end of the dataframe
In this way, I cannot specify the location of the new column, but is another useful way of performing the operation.
Photo by Jan Tinneberg on Unsplash
If you want to get rid of some rows or columns, it is simple, just drop them.
Just specify which axis you would like to drop (0 for rows, 1 for columns) and the name of that row or column you would like to drop, and you are good to go!
raw_song.drop(labels='new_col',axis=1)
.drop() allows for me to get rid of the “new_col” column I added in the previous section.
If you are looking to change the index of your dataframe to a different column in your dataframe then use .set_index() and specify the name of the column in the parentheses. However, if you know exactly what you would like to name your index use the .rename() method.
raw_song.rename(index={0:'first'}).head()
The first index of this dataframe has been changed to “first” from 0
To rename a column specify in the .rename() method the column you want to rename and what you would like to name it inside the {} similar to the renaming the index.
raw_song.rename(columns={'Position':'POSITION_RENAMED'}).head()
The first column has been changed from “Position” to “POSITION_RENAMED”
Many times when you are working with the data in your dataframe you will need to change the data in some way and iterate over all of the values in the dataframe. The easiest way to do so is already built into pandas along with a for loop:
for index, col in raw_song.iterrows():
# manipulate data here
After completing all of the manipulations to your dataframe, it is now time to export it so that you can send it elsewhere. Similar to importing the dataset from a file, it is now just the opposite. Pandas has a variety of different file types that you can write the dataframe to, but the most common is going to be to write it to a csv file.
pd.to_csv('file_name.csv')
Now you know the basics of Pandas and DataFrames. These are extremely powerful tools in your data analysis toolbox.
Also published behind a paywall at https://towardsdatascience.com/an-introduction-to-pandas-29d15a7da6d