How to use Python Seaborn for Exploratory Data Analysis

This is a tutorial of using the library in for . seaborn Python Exploratory Data Analysis (EDA) EDA is another critical process in data analysis (or machine learning/statistical modeling), besides . Data Cleaning in Python: the Ultimate Guide (2020) In this guide, you’ll discover (with examples): How to use the Python package to produce useful and beautiful visualizations, including histograms, bar plots, scatter plots, boxplots, and heatmaps. seaborn How to with different plots. explore univariate, multivariate numerical and categorical variables How to . discover the relationships among multiple variables Lots more. Let’s get started! What is Exploratory Data Analysis (EDA) and Why? is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. Exploratory data analysis (EDA) A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. It is important to explore the data before further analysis or modeling. Within this process, we can get an overview of the insights from the dataset; we can discover trends, patterns, and relationships that are not readily apparent. What is seaborn? : statistical data visualization is a popular Python library for performing EDA. Seaborn It is based on and provides a high-level interface for drawing attractive and informative statistical graphics. matplotlib Within this post, we’ll use a as an example. scraped and cleaned YouTube dataset In our previous article , we made recommendations on how to get more views based on the same dataset. How to Get MORE YouTube Views with Machine Learning techniques Before exploring, let’s read the data into Python as dataset df pandas pd numpy np json datetime math datetime timedelta, datetime matplotlib.pyplot plt matplotlib.mlab mlab matplotlib plt.style.use( ) matplotlib.pyplot figure %matplotlib inline matplotlib.rcParams[ ] = ( , ) pd.options.mode.chained_assignment = seaborn sns df = pd.read_pickle( ) # import packages import as import as import import import from import import as import as import 'ggplot' from import 'figure.figsize' 12 8 None import as # read the data 'sydney.pkl' df contains 729 rows and 60 variables. It records different features for each video within , such as: Sydney’s YouTube channel : the number of views of the video views : the length of the video/workout in minutes length : the number of calories burned during the workout in the video calories : the number of days since the video was posted until now days_since_posted : the date when the video/workout was posted Sydney posts one video/workout almost every day date : the type of workout the video was focusing on workout_type Again, you can find more details in . We’ll just use this dataset here. How to Get MORE YouTube Views with Machine Learning techniques Univariate Analysis: Numerical Variable First, let’s explore the numerical univariate variables. We create only to include the 7 numeric features. df_numeric df_numeric = df.select_dtypes(include= ) df_numeric 'number' Histogram: Single Variable are one of our favorite plots. Histograms A is an approximate representation of the distribution of numerical data. histogram To construct a histogram, the first step is to “bin” (or “bucket”) the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. Seaborn’s function has options for: distplot : the bins setting It’s useful to plot the variable with different bins settings to discover patterns. If we don’t set this value, the library will find a useful default for us. bins : whether to plot a Gaussian This helps to estimate the shape of the probability density function of a continuous random variable. More details can be found on . kde kernel density estimate seaborn’s page : whether to draw a on the support axis This draws a small vertical tick at each observation. It helps to know the exact position of the values for the variable. rug rug plot Let’s start by looking at a single variable: length, which represents the length of the video. sns.distplot(df_numeric[ ], bins= , kde= , rug= ) 'length' 50 True True We can see both the kde line and the rug sticks in the plot below. The videos for Sydney’s channel often have a length of 30, 40, or 50 minutes, which presents a multimodal pattern. Histogram: Multiple Variables Often, we want to visualize multiple numeric variables and look at them together. We build the function below to plot histograms for a specific group of variables. plot_multiple_histograms num_plots = len(cols) num_cols = math.ceil(np.sqrt(num_plots)) num_rows = math.ceil(num_plots/num_cols) fig, axs = plt.subplots(num_rows, num_cols) ind, col enumerate(cols): i = math.floor(ind/num_cols) j = ind - i*num_cols num_rows == : num_cols == : sns.distplot(df[col], kde= , ax=axs) : sns.distplot(df[col], kde= , ax=axs[j]) : sns.distplot(df[col], kde= , ax=axs[i, j]) plot_multiple_histograms(df, [ , , , ]) # this plots multiple seaborn histograms on different subplots. # : def plot_multiple_histograms (df, cols) for in if 1 if 1 True else True else True 'length' 'views' 'calories' 'days_since_posted' We can see that different variables show different shapes of distributions, outliers, skewness, etc. Univariate Analysis: Categorical Variables Next, let’s look at categorical univariate variables. Bar Chart: Single Variable The (or countplot in seaborn) is the categorical variables’ version of the histogram. bar chart A or is a chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent. bar chart bar plot A bar graph shows comparisons among discrete categories. First, let’s select the categorical (non-numeric) variables. We plot the bar chart for the variable , which represents the body areas that the workout video is focusing on. area df_non_numeric = df.select_dtypes(exclude= ) plt.figure(figsize=( , )) sns.countplot(x= , data=df_non_numeric) #select non-numeric variables 'number' 25 7 "area" There are many areas that the videos targeted. It is hard to read without zooming in. Still, we can see that more than half (over 400) of these videos focused on the “full” body area; and the second most popular area focused on is “ab”. Bar Chart: Multiple Variables Also, we create a function to plot the bar charts of multiple variables at once. plot_multiple_countplots We use it to plot some indicator variables below. The are indicator variables for different body areas. For example, when the workout focuses on the butt, otherwise it is False. is_{}_area is_butt_area == True The are indicator variables for different workout types. For example, when the workout focuses on strength, otherwise it is False. is_{}_workout is_strength_workout == True num_plots = len(cols) num_cols = math.ceil(np.sqrt(num_plots)) num_rows = math.ceil(num_plots/num_cols) fig, axs = plt.subplots(num_rows, num_cols) ind, col enumerate(cols): i = math.floor(ind/num_cols) j = ind - i*num_cols num_rows == : num_cols == : sns.countplot(x=df[col], ax=axs) : sns.countplot(x=df[col], ax=axs[j]) : sns.countplot(x=df[col], ax=axs[i, j]) plot_multiple_countplots(df_non_numeric, [ , , , ]) # this plots multiple seaborn countplots on different subplots. # : def plot_multiple_countplots (df, cols) for in if 1 if 1 else else 'is_butt_area' 'is_upper_area' 'is_cardio_workout' 'is_strength_workout' Multivariate Analysis After exploring the variables one-by-one, let’s look at multiple variables together. Different plots can be used to explore relationships among different combinations of variables. In the last section, you can also find a modeling approach for . testing relationships among multiple variables Scatter Plot: Two Numerical Variables First, let’s see how we can discover the relationship between two numerical variables. What if we want to know how the workout length impacts the number of views? We can use scatterplots ( ) to answer the question. relplot A uses Cartesian coordinates to display values for typically two variables for a set of data. If the points are coded (color/shape/size), one additional variable can be displayed. scatter plot The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis. sns.relplot(x= , y= , data=df, aspect= ) 'length' 'views' 2.0 We can see that the more popular videos tend to have lengths between 30 and 40 minutes. Bar Chart: Two Categorical Variables What if we want to know the relationship between two categorical variables? Let’s visualize the most common 6 areas ( ) and the most common 4 workout types (workout_type2) within the videos. area2 top6 = list(df[ ].value_counts().index[: ]) df[ ] = df[ ] msk = df[ ].isin(top6) df.loc[~msk, ] = top4 = list(df[ ].value_counts().index[: ]) df[ ] = df[ ] msk = df[ ].isin(top4) df.loc[~msk, ] = 'area' 5 'area2' 'area' 'area2' 'area2' 'Other' 'workout_type' 3 'workout_type2' 'workout_type' 'workout_type2' 'workout_type2' 'Other' order = df[ ].value_counts().index sns.catplot(x= , col= , col_order=order, kind= , data=df, aspect= ) 'area2' # order the columns from highest count to lowest. "workout_type2" 'area2' "count" 0.5 We can see that “full” body “strength” workouts are the most common within the videos. Boxplot: Numerical and Categorical Variables are useful visualizations when comparing groups of categories together. Box plots A ( ) is a standardized way of displaying the dataset based on a five-number summary: the minimum, the maximum, the sample median, and the first and third quartiles. box plot box-and-whisker plot We can use side by side boxplots to compare a numeric variable among categories of a categorical variable. Do Sydney’s videos get more views on certain days of the week? Let’s plot and views. day_of_week to_replace = { : , : , : , : , : , : , : } df[ ] = df[ ].dt.dayofweek df[ ] = df[ ].replace(to_replace=to_replace) order = [ , , , , , , ] sns.boxplot(x= , y= , data=df, order=order) 0 'Monday' 1 'Tuesday' 2 'Wednesday' 3 'Thursday' 4 'Friday' 5 'Saturday' 6 'Sunday' 'day_of_week_num' 'date' 'day_of_week' 'day_of_week_num' 'Monday' 'Tuesday' 'Wednesday' 'Thursday' 'Friday' 'Saturday' 'Sunday' "day_of_week" "views" This is interesting but hard to see due to outliers. Let’s remove them. msk = df[ ] < sns.boxplot(x= , y= , data=df[msk], order=order) 'views' 400000 "day_of_week" "views" We can see that Monday videos tend to have more views than other days. While Sunday videos get the least views. Swarmplot: Numerical and Categorical Variables Another way of looking at the same question is with a . swarm plot A is a categorical scatterplot where the points are adjusted (only along the categorical axis) so that they don’t overlap. swarm plot This gives a better representation of the distribution of values. A swarm plot is a good complement to a box plot when we want to show all observations along with some representation of the underlying distribution. sns.swarmplot(x= , y= , data=df[msk], order=order) "day_of_week" "views" A swarm plot would have too many dots for larger datasets, but it’s good here with a smaller dataset. Boxplot Group: Numerical and Categorical Variables Are the views on certain days of the week higher for certain workout types? To answer this question, two categorical variables ( , ) and one numerical variable (views) are involved. workout_type day_of_week Let’s see how we can visualize the answer to this question. We can use a panel boxplot ( ) to visualize the three variables together. catplot The catplot is useful to show the relationship between a numerical and one or more categorical variables using one of several visual representations. sns.catplot(x= , y= , col= , aspect= , kind= , data=df[msk], col_order=order); "workout_type" "views" "day_of_week" .6 "box" That’s quite messy with too many categories of workout_type. Based on the distribution of workout_type, we group the categories other than “ ”, “ ”, “ ”,” ” together as ‘Other’. strength hiit stretch cardio df[ ].value_counts() 'workout_type' top4 = list(df[ ].value_counts().index[: ]) df[ ] = df[ ] msk = df[ ].isin(top4) df.loc[~msk, ] = 'workout_type' 3 'workout_type2' 'workout_type' 'workout_type2' 'workout_type2' 'Other' Also, we remove the outliers to make the plot even more clear. msk = df[ ] < sns.catplot(x= , y= , col= , kind= , data=df[msk], col_order=order, aspect= ) 'views' 400000 "workout_type2" "views" "day_of_week" "box" 0.5 We can notice things such as: “stretch” workouts are only posted on Sundays. “hiit” workouts seem to have more views on Mondays. Heatmap: Numerical and Categorical Variables We can also use pivot tables and heatmaps to visualize multiple variables. A is a data visualization technique that shows the magnitude of a phenomenon as color in two dimensions. heat map The variation in color may be by hue or intensity, giving obvious visual cues to the reader about how the phenomenon is clustered or varies over space. For example, the below heatmap has and categories as the axes; the color scale represents views in each cell. area workout_type df_area_workout = df.groupby([ , ])[ ].count().reset_index() df_area_workout_pivot = df_area_workout.pivot(index= , columns= , values= ).fillna( ) sns.heatmap(df_area_workout_pivot, annot= , fmt= , cmap= ) 'area' 'workout_type' 'views' 'area' 'workout_type' 'views' 0 True '.0f' "YlGnBu" (Advanced) Relationship Test and Scatterplot: Numerical and Categorical Variables How do we automatically discover the relationships among multiple variables? Let’s take the most critical features below and see how we could find interesting relationships. cols = [ , , , , , , ] df_test = df[cols] df_test.head() # group of critical features selected 'length' 'views' 'calories' 'days_since_posted' 'area' 'workout_type' 'day_of_week' numeric_columns = set(df_test.select_dtypes(include=[ ]).columns) non_numeric_columns = set(df_test.columns) - numeric_columns print(numeric_columns) print(non_numeric_columns) 'number' We have 4 numerical variables and 3 categorical variables. There could be many complicated relationships among them! In this section, we use the same method to test for relationships (including multicollinearity) among them as in . How to Get MORE YouTube Views with Machine Learning techniques At a high level, we use to achieve this. K-fold cross-validation First, we transform the categorical variables. Since we will be using 5-fold cross-validation, we need to make sure there are at least 5 observations for each category level. c non_numeric_columns: cnt = df_test[c].value_counts() small_cnts = list(cnt[cnt < ].index) s_replace = {} sm small_cnts: s_replace[sm] = df_test[c] = df_test[c].replace(s_replace) df_test[c] = df_test[c].fillna( ) for in 5 for in 'other' 'other' Next, we loop through each variable and fit a model to predict it using the other variables. We use a simple model of and K-fold validation. Gradient Boosting Model (GBM) Depending on whether the target variable is numerical or categorical, we apply different models and scores (model predictive power evaluation metrics). When the target is numerical, we use the model and ; when the target is categorical, we use the model and . Gradient Boosting Regressor Root Mean Squared Error (RMSE) Gradient Boosting Classifier Accuracy For each target, we print out the (average of the scores) and the most important 5 predictors. K-fold validation score We also add three features , , composed of random numbers. They serve as anchors when comparing the relationship between variables. If one predictor is less important or similar compared to these random variables, then it is not an important predictor of the target variable. rand0 rand1 rand2 sklearn.ensemble GradientBoostingRegressor, GradientBoostingClassifier sklearn.model_selection cross_val_score df_test[ ] = np.random.rand(df_test.shape[ ]) df_test[ ] = np.random.rand(df_test.shape[ ]) df_test[ ] = np.random.rand(df_test.shape[ ]) reg = GradientBoostingRegressor(n_estimators= , max_depth= , learning_rate= , loss= , random_state= ) clf = GradientBoostingClassifier(n_estimators= , max_depth= , learning_rate= , loss= , random_state= ) df_test[ ] = df_test[ ].fillna( ) c cols: c [ , , ]: X = df_test.drop([c], axis= ) X = pd.get_dummies(X) y = df_test[c] print(c) c non_numeric_columns: scoring = model = clf scores = cross_val_score(clf, X, y, cv= , scoring=scoring) print(scoring + % (scores.mean(), scores.std() * )) c numeric_columns: scoring = model = reg scores = cross_val_score(reg, X, y, cv= , scoring=scoring) print(scoring.replace( , ) + % (-scores.mean(), scores.std() * )) : print( ) model.fit(X, y) df_importances = pd.DataFrame(data={ : X.columns, : model.feature_importances_}).sort_values(by= , ascending= ) top5_features = df_importances.iloc[: ] print( ) print(top5_features) print() from import from import # we are going to look at feature importances so we like putting random features to act as a benchmark. 'rand0' 0 'rand1' 0 'rand2' 0 # testing for relationships. # for numeric targets. 100 5 0.1 'ls' 1 # for categorical targets. 100 5 0.1 'deviance' 1 'calories' 'calories' 0 # only calories should have missing values. # try to predict one feature using the rest of others to test collinearity, so it's easier to interpret the results for in # c is the thing to predict. if not in 'rand0' 'rand1' 'rand2' 1 # drop the thing to predict. if in 'accuracy' 5 ": %0.2f (+/- %0.2f)" 2 elif in 'neg_root_mean_squared_error' 5 'neg_' '' ": %0.2f (+/- %0.2f)" 2 else 'what is this?' 'feature_name' 'importance' 'importance' False 5 'top 5 features:' From the results above, we can look into each of the target variables and their relationship with the predictors. Again, the step-by-step procedure of this test can be found in the Test for Multicollinearity section in . How to Get MORE YouTube Views with Machine Learning techniques We can see that there is a strong relationship between length and calories. Let’s use a scatter plot to visualize them: the as length and the as calories, while the size of the dots represents the views. x-axis y-axis sns.relplot(x= , y= , size= , sizes=( , ), data=df, aspect= ) # Length, cal 'length' 'calories' 'views' 10 1000 3.0 We can see that the longer the video, the more calories are burned, which is intuitive. We can also see that the videos with more views tend to have a shorter length. : Related articles How to Get MORE YouTube Views with Machine Learning techniques This previous post used the same dataset. It contains details of how we scraped and transformed the original dataset. Data Cleaning in Python: the Ultimate Guide (2020) This article covers what to clean and techniques to clean missing data, outliers, duplicates, inconsistent data, etc. Thank you for reading! Leave a comment if you have any questions. We’ll try our best to answer. Before you leave, don’t forget to ! Or connect with us on , . So you miss any new data science articles from us. sign up for the Just into Data newsletter Twitter Facebook won’t Previously published at https://www.justintodata.com/how-to-use-python-seaborn-for-exploratory-data-analysis/