Data visualization is a form of visual communication. It involves the creation and study of the visual representation of data.
We'll be implementing various data visualization techniques on the 'iris' dataset.
Let us look at some of these plots used in data visualization one by one :
First we need to import two important libraries for data visualization -
Matplotlib is a python library used extensively for the visualization of data. While Seaborn is a python library based on matplotlib. Seaborn provides a high-level interface for drawing attractive and informative statistical graphics.
import matplotlib.pyplot as plt
import seaborn as sns
iris = pd.read_csv("iris.csv")
It is one of the most commonly used plots for simple data visualization. It gives us a representation of where each point in the entire dataset are present with respect to any 2 or 3 features (or columns). They are available in 2D as well as 3D.
# Here we are plotting sepal_length vs sepal_width
# setosa - 'red'; versicolor - 'blue'; virginica - 'green'
for n in range(0,150):
if iris['species'][n] == 'setosa':
plt.scatter(iris['sepal_length'][n], iris['sepal_width'][n], color = 'red')
plt.xlabel('sepal_length')
plt.ylabel('sepal_width')
elif iris['species'][n] == 'versicolor':
plt.scatter(iris['sepal_length'][n], iris['sepal_width'][n], color = 'blue')
plt.xlabel('sepal_length')
plt.ylabel('sepal_width')
elif iris['species'][n] == 'virginica':
plt.scatter(iris['sepal_length'][n], iris['sepal_width'][n], color = 'green')
plt.xlabel('sepal_length')
plt.ylabel('sepal_width')
Lets say we have n number of features in a data, Pair plot will help us create us a (n x n) figure where the diagonal plots will be histogram plot of the feature corresponding to that row and rest of the plots are the combination of feature from each row in y axis and feature from each column in x axis.
The code snippet for pair plot implemented on Iris dataset is :
A box plot (or box-and-whisker plot) shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable. The box shows the quartiles of the dataset while the whiskers extend to show the rest of the distribution.
Code for plotting the features using Box plots :
# Plotting the features using boxes
plt.style.use('ggplot')
plt.subplot(2,2,1)
sns.boxplot(x = 'species', y = 'sepal_length', data = iris)
plt.subplot(2,2,2)
sns.boxplot(x = 'species', y = 'sepal_width', data = iris)
plt.subplot(2,2,3)
sns.boxplot(x = 'species', y = 'petal_length', data = iris)
plt.subplot(2,2,4)
sns.boxplot(x = 'species', y = 'petal_width', data = iris)
The violin plots can be inferred as a combination of Box plot at the middle and distribution plots (Kernel Density Estimation ) on both side of the data. This can give us the details of distribution like whether the distribution is mutimodal, Skewness etc.
Violin plot is also from seaborn package. The code is simple and as follows.
# Representing data using violin form
plt.style.use('ggplot')
plt.subplot(2,2,1)
sns.violinplot(x = 'species', y = 'sepal_length', data = iris)
plt.subplot(2,2,2)
sns.violinplot(x = 'species', y = 'sepal_width', data = iris)
plt.subplot(2,2,3)
sns.violinplot(x = 'species', y = 'petal_length', data = iris)
plt.subplot(2,2,4)
sns.violinplot(x = 'species', y = 'petal_width', data = iris)
Join plots can do both univariate as well as bivariate analysis. The main plot will give us a bivariate analysis, whereas on the top and right side we will get univariate plots of both the variables that were considered. It makes our job easy by getting both scatter plots for bivariate and Distribution plot for univariate, both in a single plot.
There are variety of option you can choose from, which can be tuned using kind parameter in seaborn’s jointplot function.
# Joint plots shows bivariate scatterplots
# And univariate histograms
sns.jointplot(x = 'sepal_length', y = 'sepal_width', data = iris)
A strip plot can be drawn on its own, but it is also a good complement to a box or violin plot in cases where you want to show all observations along with some representation of the underlying distribution.
It is is a graphical data anlysis technique for summarizing a univariate data set. It is typically used for small data sets (histograms and density plots are typically preferred for larger data sets).
# Plottign data in strip
plt.subplot(2,2,1)
sns.stripplot(x = 'species', y = 'sepal_length', data = iris, jitter = True)
plt.subplot(2,2,2)
sns.stripplot(x = 'species', y = 'sepal_width', data = iris, jitter = True)
plt.subplot(2,2,3)
sns.stripplot(x = 'species', y = 'petal_length', data = iris, jitter = True)
plt.subplot(2,2,4)
sns.stripplot(x = 'species', y = 'petal_width', data = iris, jitter = True)
Seaborn's lmplot is a 2D scatterplot with an optional overlaid regression line. Logistic regression for binary classification is also supported with lmplot . It is intended as a convenient interface to fit regression models across conditional subsets of a dataset.
The fuction can draw a scatterplot of two variables, x and y, and then fit the regression model y ~ x and plot the resulting regression line with a 95% confidence interval for that regression.
lmplot() has data as a required parameter and the x and y variables must be specified as strings.
# This graph is same as above but plotting the species separately
sns.lmplot(x = 'sepal_length', y = 'sepal_width', data = iris, hue = 'species', col = 'species')
So here you go, you have learned about the different kinds of plots that you could make using seaborn and matplotlib library. Data visualization not only helps you to understand your data well but whenever you find any insights, you can use these visualization techniques to share your findings with other people.
Now go on and try creating such amazing plots on some real-world data sets.