This is a tutorial of using the seaborn library in Python for Exploratory Data Analysis (EDA).
EDA is another critical process in data analysis (or machine learning/statistical modeling), besides Data Cleaning in Python: the Ultimate Guide (2020).
In this guide, you’ll discover (with examples):
Let’s get started!
Exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods.
A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
It is important to explore the data before further analysis or modeling. Within this process, we can get an overview of the insights from the dataset; we can discover trends, patterns, and relationships that are not readily apparent.
Seaborn: statistical data visualization is a popular Python library for performing EDA.
It is based on matplotlib and provides a high-level interface for drawing attractive and informative statistical graphics.
Within this post, we’ll use a scraped and cleaned YouTube dataset as an example.
In our previous article How to Get MORE YouTube Views with Machine Learning techniques, we made recommendations on how to get more views based on the same dataset.
Before exploring, let’s read the data into Python as dataset df
# import packages
import pandas as pd
import numpy as np
import json
import datetime
import math
from datetime import timedelta, datetime
import matplotlib.pyplot as plt
import matplotlib.mlab as mlab
import matplotlib
plt.style.use('ggplot')
from matplotlib.pyplot import figure
%matplotlib inline
matplotlib.rcParams['figure.figsize'] = (12,8)
pd.options.mode.chained_assignment = None
import seaborn as sns
# read the data
df = pd.read_pickle('sydney.pkl')
df contains 729 rows and 60 variables. It records different features for each video within Sydney’s YouTube channel, such as:
Again, you can find more details in How to Get MORE YouTube Views with Machine Learning techniques. We’ll just use this dataset here.
First, let’s explore the numerical univariate variables.
We create df_numeric only to include the 7 numeric features.
df_numeric = df.select_dtypes(include='number')
df_numeric
Histograms are one of our favorite plots.
A histogram is an approximate representation of the distribution of numerical data.
To construct a histogram, the first step is to “bin” (or “bucket”) the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval.
Seaborn’s function distplot has options for:
Let’s start by looking at a single variable: length, which represents the length of the video.
sns.distplot(df_numeric['length'], bins=50, kde=True, rug=True)
We can see both the kde line and the rug sticks in the plot below.
The videos for Sydney’s channel often have a length of 30, 40, or 50 minutes, which presents a multimodal pattern.
Often, we want to visualize multiple numeric variables and look at them together.
We build the function plot_multiple_histograms below to plot histograms for a specific group of variables.
# this plots multiple seaborn histograms on different subplots.
#
def plot_multiple_histograms(df, cols):
num_plots = len(cols)
num_cols = math.ceil(np.sqrt(num_plots))
num_rows = math.ceil(num_plots/num_cols)
fig, axs = plt.subplots(num_rows, num_cols)
for ind, col in enumerate(cols):
i = math.floor(ind/num_cols)
j = ind - i*num_cols
if num_rows == 1:
if num_cols == 1:
sns.distplot(df[col], kde=True, ax=axs)
else:
sns.distplot(df[col], kde=True, ax=axs[j])
else:
sns.distplot(df[col], kde=True, ax=axs[i, j])
plot_multiple_histograms(df, ['length', 'views', 'calories', 'days_since_posted'])
We can see that different variables show different shapes of distributions, outliers, skewness, etc.
Next, let’s look at categorical univariate variables.
The bar chart (or countplot in seaborn) is the categorical variables’ version of the histogram.
A bar chart or bar plot is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent.
A bar graph shows comparisons among discrete categories.
First, let’s select the categorical (non-numeric) variables.
We plot the bar chart for the variable area, which represents the body areas that the workout video is focusing on.
#select non-numeric variables
df_non_numeric = df.select_dtypes(exclude='number')
plt.figure(figsize=(25,7))
sns.countplot(x="area",
data=df_non_numeric)
There are many areas that the videos targeted. It is hard to read without zooming in. Still, we can see that more than half (over 400) of these videos focused on the “full” body area; and the second most popular area focused on is “ab”.
Also, we create a function plot_multiple_countplots to plot the bar charts of multiple variables at once.
We use it to plot some indicator variables below.
The
is_{}_area
are indicator variables for different body areas. For example, is_butt_area == True
when the workout focuses on the butt, otherwise it is False.The
is_{}_workout
are indicator variables for different workout types. For example, is_strength_workout == True
when the workout focuses on strength, otherwise it is False.# this plots multiple seaborn countplots on different subplots.
#
def plot_multiple_countplots(df, cols):
num_plots = len(cols)
num_cols = math.ceil(np.sqrt(num_plots))
num_rows = math.ceil(num_plots/num_cols)
fig, axs = plt.subplots(num_rows, num_cols)
for ind, col in enumerate(cols):
i = math.floor(ind/num_cols)
j = ind - i*num_cols
if num_rows == 1:
if num_cols == 1:
sns.countplot(x=df[col], ax=axs)
else:
sns.countplot(x=df[col], ax=axs[j])
else:
sns.countplot(x=df[col], ax=axs[i, j])
plot_multiple_countplots(df_non_numeric, ['is_butt_area', 'is_upper_area', 'is_cardio_workout', 'is_strength_workout'])
After exploring the variables one-by-one, let’s look at multiple variables together.
Different plots can be used to explore relationships among different combinations of variables.
In the last section, you can also find a modeling approach for testing relationships among multiple variables.
First, let’s see how we can discover the relationship between two numerical variables.
What if we want to know how the workout length impacts the number of views?
We can use scatterplots (relplot) to answer the question.
A scatter plot uses Cartesian coordinates to display values for typically two variables for a set of data. If the points are coded (color/shape/size), one additional variable can be displayed.
The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.
sns.relplot(x='length', y='views', data=df, aspect=2.0)
We can see that the more popular videos tend to have lengths between 30 and 40 minutes.
What if we want to know the relationship between two categorical variables?
Let’s visualize the most common 6 areas (area2) and the most common 4 workout types (workout_type2) within the videos.
top6 = list(df['area'].value_counts().index[:5])
df['area2'] = df['area']
msk = df['area2'].isin(top6)
df.loc[~msk, 'area2'] = 'Other'
top4 = list(df['workout_type'].value_counts().index[:3])
df['workout_type2'] = df['workout_type']
msk = df['workout_type2'].isin(top4)
df.loc[~msk, 'workout_type2'] = 'Other'
order = df['area2'].value_counts().index # order the columns from highest count to lowest.
sns.catplot(x="workout_type2",
col='area2',
col_order=order,
kind="count", data=df,
aspect=0.5)
We can see that “full” body “strength” workouts are the most common within the videos.
Box plots are useful visualizations when comparing groups of categories together.
A box plot (box-and-whisker plot) is a standardized way of displaying the dataset based on a five-number summary: the minimum, the maximum, the sample median, and the first and third quartiles.
We can use side by side boxplots to compare a numeric variable among categories of a categorical variable.
Do Sydney’s videos get more views on certain days of the week?
Let’s plot day_of_week and views.
to_replace = {0:'Monday', 1:'Tuesday', 2:'Wednesday', 3:'Thursday', 4:'Friday', 5:'Saturday', 6:'Sunday'}
df['day_of_week_num'] = df['date'].dt.dayofweek
df['day_of_week'] = df['day_of_week_num'].replace(to_replace=to_replace)
order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
sns.boxplot(x="day_of_week", y="views", data=df, order=order)
This is interesting but hard to see due to outliers. Let’s remove them.
msk = df['views'] < 400000
sns.boxplot(x="day_of_week", y="views", data=df[msk], order=order)
We can see that Monday videos tend to have more views than other days. While Sunday videos get the least views.
Another way of looking at the same question is with a swarm plot.
A swarm plot is a categorical scatterplot where the points are adjusted (only along the categorical axis) so that they don’t overlap.
This gives a better representation of the distribution of values.
A swarm plot is a good complement to a box plot when we want to show all observations along with some representation of the underlying distribution.
sns.swarmplot(x="day_of_week", y="views", data=df[msk], order=order)
A swarm plot would have too many dots for larger datasets, but it’s good here with a smaller dataset.
Are the views on certain days of the week higher for certain workout types?
To answer this question, two categorical variables (
workout_type
, day_of_week
) and one numerical variable (views) are involved.Let’s see how we can visualize the answer to this question.
We can use a panel boxplot (catplot) to visualize the three variables together.
The catplot is useful to show the relationship between a numerical and one or more categorical variables using one of several visual representations.
sns.catplot(x="workout_type", y="views",
col="day_of_week", aspect=.6,
kind="box", data=df[msk], col_order=order);
That’s quite messy with too many categories of workout_type.
Based on the distribution of workout_type, we group the categories other than “
strength
”, “hiit
”, “stretch
”,”cardio
” together as ‘Other’.
df['workout_type'].value_counts()
top4 = list(df['workout_type'].value_counts().index[:3])
df['workout_type2'] = df['workout_type']
msk = df['workout_type2'].isin(top4)
df.loc[~msk, 'workout_type2'] = 'Other'
Also, we remove the outliers to make the plot even more clear.
msk = df['views'] < 400000
sns.catplot(x="workout_type2", y="views",
col="day_of_week",
kind="box", data=df[msk], col_order=order,
aspect=0.5)
We can notice things such as:
We can also use pivot tables and heatmaps to visualize multiple variables.
A heat map is a data visualization technique that shows the magnitude of a phenomenon as color in two dimensions.
The variation in color may be by hue or intensity, giving obvious visual cues to the reader about how the phenomenon is clustered or varies over space.
For example, the below heatmap has area and workout_type categories as the axes; the color scale represents views in each cell.
df_area_workout = df.groupby(['area', 'workout_type'])['views'].count().reset_index()
df_area_workout_pivot = df_area_workout.pivot(index='area', columns='workout_type', values='views').fillna(0)
sns.heatmap(df_area_workout_pivot, annot=True, fmt='.0f', cmap="YlGnBu")
How do we automatically discover the relationships among multiple variables?
Let’s take the most critical features below and see how we could find interesting relationships.
# group of critical features selected
cols = ['length', 'views', 'calories', 'days_since_posted', 'area', 'workout_type', 'day_of_week']
df_test = df[cols]
df_test.head()
numeric_columns = set(df_test.select_dtypes(include=['number']).columns)
non_numeric_columns = set(df_test.columns) - numeric_columns
print(numeric_columns)
print(non_numeric_columns)
We have 4 numerical variables and 3 categorical variables.
There could be many complicated relationships among them!
In this section, we use the same method to test for relationships (including multicollinearity) among them as in How to Get MORE YouTube Views with Machine Learning techniques.
At a high level, we use K-fold cross-validation to achieve this.
First, we transform the categorical variables. Since we will be using 5-fold cross-validation, we need to make sure there are at least 5 observations for each category level.
for c in non_numeric_columns:
cnt = df_test[c].value_counts()
small_cnts = list(cnt[cnt < 5].index)
s_replace = {}
for sm in small_cnts:
s_replace[sm] = 'other'
df_test[c] = df_test[c].replace(s_replace)
df_test[c] = df_test[c].fillna('other')
Next, we loop through each variable and fit a model to predict it using the other variables. We use a simple model of Gradient Boosting Model (GBM) and K-fold validation.
Depending on whether the target variable is numerical or categorical, we apply different models and scores (model predictive power evaluation metrics).
When the target is numerical, we use the Gradient Boosting Regressor model and Root Mean Squared Error (RMSE); when the target is categorical, we use the Gradient Boosting Classifier model and Accuracy.
For each target, we print out the K-fold validation score (average of the scores) and the most important 5 predictors.
We also add three features rand0, rand1, rand2 composed of random numbers. They serve as anchors when comparing the relationship between variables. If one predictor is less important or similar compared to these random variables, then it is not an important predictor of the target variable.
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
# we are going to look at feature importances so we like putting random features to act as a benchmark.
df_test['rand0'] = np.random.rand(df_test.shape[0])
df_test['rand1'] = np.random.rand(df_test.shape[0])
df_test['rand2'] = np.random.rand(df_test.shape[0])
# testing for relationships.
# for numeric targets.
reg = GradientBoostingRegressor(n_estimators=100, max_depth=5,
learning_rate=0.1, loss='ls',
random_state=1)
# for categorical targets.
clf = GradientBoostingClassifier(n_estimators=100, max_depth=5,
learning_rate=0.1, loss='deviance',
random_state=1)
df_test['calories'] = df_test['calories'].fillna(0) # only calories should have missing values.
# try to predict one feature using the rest of others to test collinearity, so it's easier to interpret the results
for c in cols:
# c is the thing to predict.
if c not in ['rand0', 'rand1', 'rand2']:
X = df_test.drop([c], axis=1) # drop the thing to predict.
X = pd.get_dummies(X)
y = df_test[c]
print(c)
if c in non_numeric_columns:
scoring = 'accuracy'
model = clf
scores = cross_val_score(clf, X, y, cv=5, scoring=scoring)
print(scoring + ": %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))
elif c in numeric_columns:
scoring = 'neg_root_mean_squared_error'
model = reg
scores = cross_val_score(reg, X, y, cv=5, scoring=scoring)
print(scoring.replace('neg_', '') + ": %0.2f (+/- %0.2f)" % (-scores.mean(), scores.std() * 2))
else:
print('what is this?')
model.fit(X, y)
df_importances = pd.DataFrame(data={'feature_name': X.columns, 'importance': model.feature_importances_}).sort_values(by='importance', ascending=False)
top5_features = df_importances.iloc[:5]
print('top 5 features:')
print(top5_features)
print()
From the results above, we can look into each of the target variables and their relationship with the predictors.
Again, the step-by-step procedure of this test can be found in the Test for Multicollinearity section in How to Get MORE YouTube Views with Machine Learning techniques.
We can see that there is a strong relationship between length and calories.
Let’s use a scatter plot to visualize them: the
x-axis
as length and the y-axis
as calories, while the size of the dots represents the views.# Length, cal
sns.relplot(x='length',
y='calories', size='views', sizes=(10, 1000), data=df, aspect=3.0)
We can see that the longer the video, the more calories are burned, which is intuitive. We can also see that the videos with more views tend to have a shorter length.
Related articles:
How to Get MORE YouTube Views with Machine Learning techniques
This previous post used the same dataset. It contains details of how we scraped and transformed the original dataset.
Data Cleaning in Python: the Ultimate Guide (2020)
This article covers what to clean and techniques to clean missing data, outliers, duplicates, inconsistent data, etc.
Thank you for reading!
Leave a comment if you have any questions. We’ll try our best to answer.
Before you leave, don’t forget to sign up for the Just into Data newsletter! Or connect with us on Twitter, Facebook.
So you won’t miss any new data science articles from us.
Previously published at https://www.justintodata.com/how-to-use-python-seaborn-for-exploratory-data-analysis/