Understanding The Concept of Clustering In Unsupervised Learning by@pramod.pandey83

January 24th 2020 317 reads

**What is hierarchical clustering in unsupervised learning?**

Data in itself has no value, it actually finds its expression when it is processed right, for the right purpose using the right tools.

So when it comes to understanding the data it becomes extremely important that we are not only looking to extract obvious insights but also to identify the hidden patterns which may not be easy to find just by exploratory data analysis. To make intelligent predictions, identifying patterns and make effective recommendations our data need to be segregated into meaningful clusters. This stream of machine learning where we do not rely on a labeled data set which has a target variable already defined instead rely upon clustering the data sets into groups and try to make predictions about the behavior is called unsupervised learning.

Unsupervised learning collaborates with supervised machine learning to make our model robust and reliable. So today we will look into unsupervised learning techniques, we will go into details of

*- What is Unsupervised Learning?*

*- Types Of Unsupervised Learning*

*- Understanding clustering & its types*

*- Hands-on on K-Means & hierarchical clustering*

Let’s start this journey of learning by understanding unsupervised learning.

It’s a learning process where our machine has the unlabeled data set to be processed and it has to find the unknown/unseen pattern in the given data set.

*It’s is a kind of self-organized **Hebbian learning** that helps the machine to find previously unknown patterns in data set without pre-existing labels.*

Here in the pic, you can see that the first robot has got the data set where the required result to be tested & predicted is known to him. This is what we call **supervised learning** with labeled data. Whereas in the second case our robot is speaking about how it doesn’t have any clue about what is an outcome he is trying to achieve with the given data set. Here there is no pre-labeled data set fed to the robot, this is what we call **Unsupervised learning.**

Unlike supervised learning where we have labeled data that needs to be classified or predicted, here we don’t have any target variable which is used to measure the model outcome. Our machine self-learn from the unlabeled data set & and allows modeling probability densities of the given input.

But the question is how does our machine learn in this scenario? Well, here what we do is that we group the given data set where every group has some kind of pattern or common behavior amongst them. Let’s get into the depth of it by understanding what is clustering & types of unsupervised learning.

Two major types of unsupervised learning methodology are:

*1. Principal Component Analysis*

*2. Cluster Analysis(Clustering)*

*Principal Component Analysis (PCA) is a statistical procedure that orthogonally transforms the original n coordinates of a data set into a new set of n coordinates called principal components. As a result of the transformation, the first principal component has the largest possible variance; each succeeding component has the highest possible variance under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components.*

Clustering is a methodology where we group or segment the given datasets where each cluster/group of data has shared attributes to extrapolate algorithmic relationships.

In this machine learning technique, the cluster of the data is not labeled, classified or categorized. The clustering of data is done based on the similarity of the feature.

*Instead of responding to feedback, cluster analysis identifies commonalities in the data and reacts based on the presence or absence of such commonalities in each new piece of data.*

Cluster analysisorclusteringis the task of grouping a set of objects in such a way that objects in the same group (called acluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.

Clustering divides objects based on unknown features. Machine chooses the best way

Going further, we will mainly look into clustering techniques and associated algorithm with it, then we will look K-Means in detail and go hands-on with one problem statement using python & Jupyter notebook

Nowadays used:

For market segmentation (types of customers, loyalty)To merge close points on a mapFor image compressionTo analyze and label new dataTo detect abnormal behavior

Going further, we will mainly look into clustering techniques and associated algorithms with it, then we will look at K-Means in detail and go hands-on with one problem statement using python & Jupyter notebook.

**For example,** you must have used Apple Photos and Google Photos in your phone, here a more complex level of clustering is involved. Your pics are clubbed based on all the faces in photos to create albums of your friends. The concerned app doesn’t know how many friends you have and how they look, but it’s trying to find the common facial features, it is a typical case of unsupervised learning analysis, which involves clustering.

*Similarly e-Commerce giants like amazon used customer segmentation to offer product recommendations based on group behavior and similarly of the attributes of the consumers in that particular group.*

3 major type of clustering techniques are:

** -**Hierarchical Clustering

-Partitional Clustering

-Bayesian Clustering

The below diagram gives the details of the type of clustering mechanism employed in unsupervised learning.

We will mainly look into two of them in today’s journey.

**Hierarchical Clustering**

**K-Means Clustering**

Hierarchical clustering is an algorithm that groups similar objects into a cluster where each successive cluster is formed based on the previously established cluster. The endpoint is a set of clusters*, *where each cluster is different & distinct from each other, and also the attributes within each cluster are broadly similar to each other.

Strategies for hierarchical clustering generally fall into two types

**Agglomerative**: This is a bottom-up approach: where each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy.**Divisive**: This is a top-down approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.

These clustering method partitions the objects into k clusters and each partition form one cluster. This method is used to optimize an objective criterion similarity function such as when the distance is a major parameter example *K-means, CLARANS (Clustering Large Applications based upon randomized Search),* etc. This mechanism determines all the cluster at once.

*Partitional clustering decomposes a data set into a set of disjoint clusters. Given a data set of N points, a partitioning method constructs K (N ≥ K) partitions of the data, with each partition representing a cluster. That is, it classifies the data into K groups by satisfying the following requirements: (1) each group contains at least one point, and (2) each point belongs to exactly one group. Notice that for fuzzy partitioning, a point can belong to more than one group.*

One of the popular partitional clustering methods is

*k*-means clustering is a method to partition *n number of *observations into *k* clusters where each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster.

*The outcome of the K-means clustering algorithm is:*

The centroids of the *K* clusters, which can be used to label new dataLabels for the training data (each data point is assigned to a single cluster)

It aims to partition a set of observations into several clusters (k), resulting in the partitioning of the data into Voronoi cells. It can be considered a method of finding out which group a certain object belongs to.

*This algorithm works in the following manner:*

*K points are placed into the object data space representing the initial group of centroids.Each object or data point is assigned to the closest k.After all, objects are assigned, the positions of the k centroids are recalculated.Steps 2 and 3 are repeated until the positions of the centroids no longer move.*

It is a Bottom-up hierarchical clustering algorithm, where we treat each data point as a single cluster to start with and then successively merge (or *agglomerate*) pairs of clusters until all clusters have been merged into a single cluster that contains all data points. While we do this process of merging the cluster based on similarity we end up with a tree called a *dendrogram*

*The root of the tree is the unique cluster that gathers all the samples, the leaves being the clusters with only one sample*

src: Agglomerative Hierarchical Clustering

To understand how the above merging happens we need to understand the following key concepts:

The distance between two given clusters is computed using the distance metrics like Euclidean distance. There are other distance metrics too, like Euclidean squared, Manhattan, etc., but the choice of a distance metric purely depends on the domain and the business problem statement which one is trying to solve.

Once we decide to select the right distance metric, one needs to decide from which point distance is being calculated. We achieve this by using various types of linkage options. H*ow *the pairs merge in Agglomerative clustering, involves calculating a dissimilarity between each merged pair and the other sample, which is achieved by the linkage criteria.

**Single Linkage:**

Here the distance between two clusters is the distance between the nearest neighbors in those clusters.

**Example:** The distance between clusters “r” and “s” to the left is equal to the length of the arrow between their two closest points.

*In single-link clustering, we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster.*

**Complete Linkage:**

**Example:** the distance between clusters “r” and “s” to the left is equal to the length of the arrow between their two furthest points

In this type of clustering (it is also called the *diameter* or *maximum* method), we consider the distance between one cluster and another cluster to be equal to the longest distance from any member of one cluster to any member of the other cluster.

*Here the distance between two clusters is the distance between the furthest points in those clusters.*

**Average Linkage:**

In *average-link* clustering, we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.

**Example**: the distance between clusters “r” and “s” to the left is equal to the average length each arrow between connecting the points of one cluster to the other.

*Here the distance between two clusters is the average of the distances between all the points in those clusters.*

**Centroid-linkage**: finds the centroid of cluster 1 and centroid of cluster 2, and then calculates the distance between the two before merging.

It is a top-down clustering method and is less commonly used. It works similarly to agglomerative clustering but, in the opposite direction.

*Here, all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.*

**To Summarize: Here**

*Data starts as one single combined cluster.The cluster splits into two distinct parts, based on some degree of similarity.Clusters split into two again and again until the clusters only contain a single data point.*

Divisive Clustering method is rigid i.e., once a merging or splitting is done, it can never be undone

We will make use of Wholesale customer Data sets. This file has been sourced from UCI: https://archive.ics.uci.edu/ml/datasets/wholesale+customers

The data set refers to the clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories

*Note! **We expect that you have installed anaconda from **here**, to use Jupyter as your code editor. we will be making use of Python version 3 which will come pre-loaded in anaconda installation.*

FRESH: annual spending (m.u.) on fresh products (Continuous)

MILK: annual spending (m.u.) on milk products (Continuous)

GROCERY: annual spending (m.u.)on grocery products (Continuous)

FROZEN: annual spending (m.u.)on frozen products (Continuous)

DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)

DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous);

CHANNEL: customers Channel — Horeca (1: Hotel/Restaurant/Cafe) or Retail channel (Nominal).

We will try to cluster this given dataset into a customer segment, which is achieved based on the purchase made by various consumers. The intent is to cluster similar customers together using a hierarchical clustering technique

I have created a simple Kaggle notebook to help you get hands-on experience in solving the problem statement and achieve the desired objective. Here is the link:

https://www.kaggle.com/pramodchandrayan/understanding-clustering-in-unsupervised-learning

We will look deeper into another important data clustering technique called **K-Means Clustering **.

Data science works well when the two exponent of machine learning mechanism named supervised and unsupervised works closely as a team. When you want your data to make more sense and to be structured well employ unsupervised learning methodology of data clustering & dimensionality reduction using PCA.Now this data will be more useful to be used in supervised learning algo to help you build effective models.