Understanding The Concept of Clustering In Unsupervised Learning by@pramod.pandey83

Data in itself has no value, it actually finds its expression when it is processed right, for the right purpose using the right tools.

So when it comes to understanding the data it becomes extremely important that we are not only looking to extract obvious insights but also to identify the hidden patterns which may not be easy to find just by exploratory data analysis. To make intelligent predictions, identifying patterns and make effective recommendations our data need to be segregated into meaningful clusters. This stream of machine learning where we do not rely on a labeled data set which has a target variable already defined instead rely upon clustering the data sets into groups and try to make predictions about the behavior is called unsupervised learning.

Unsupervised learning collaborates with supervised machine learning to make our model robust and reliable. So today we will look into unsupervised learning techniques, we will go into details of

Let’s start this journey of learning by understanding unsupervised learning.

It’s a learning process where our machine has the unlabeled data set to be processed and it has to find the unknown/unseen pattern in the given data set.

Here in the pic, you can see that the first robot has got the data set where the required result to be tested & predicted is known to him. This is what we call **supervised learning** with labeled data. Whereas in the second case our robot is speaking about how it doesn’t have any clue about what is an outcome he is trying to achieve with the given data set. Here there is no pre-labeled data set fed to the robot, this is what we call **Unsupervised learning.**

Unlike supervised learning where we have labeled data that needs to be classified or predicted, here we don’t have any target variable which is used to measure the model outcome. Our machine self-learn from the unlabeled data set & and allows modeling probability densities of the given input.

But the question is how does our machine learn in this scenario? Well, here what we do is that we group the given data set where every group has some kind of pattern or common behavior amongst them. Let’s get into the depth of it by understanding what is clustering & types of unsupervised learning.

Two major types of unsupervised learning methodology are:

Clustering is a methodology where we group or segment the given datasets where each cluster/group of data has shared attributes to extrapolate algorithmic relationships.

In this machine learning technique, the cluster of the data is not labeled, classified or categorized. The clustering of data is done based on the similarity of the feature.

Cluster analysisorclusteringis the task of grouping a set of objects in such a way that objects in the same group (called acluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.

Clustering divides objects based on unknown features. Machine chooses the best way

Going further, we will mainly look into clustering techniques and associated algorithm with it, then we will look K-Means in detail and go hands-on with one problem statement using python & Jupyter notebook

Nowadays used:

For market segmentation (types of customers, loyalty)To merge close points on a mapFor image compressionTo analyze and label new dataTo detect abnormal behavior

Going further, we will mainly look into clustering techniques and associated algorithms with it, then we will look at K-Means in detail and go hands-on with one problem statement using python & Jupyter notebook.

3 major type of clustering techniques are:

-Partitional Clustering

-Bayesian Clustering

The below diagram gives the details of the type of clustering mechanism employed in unsupervised learning.

We will mainly look into two of them in today’s journey.

**Hierarchical Clustering**

**K-Means Clustering**

Hierarchical clustering is an algorithm that groups similar objects into a cluster where each successive cluster is formed based on the previously established cluster. The endpoint is a set of clusters*, *where each cluster is different & distinct from each other, and also the attributes within each cluster are broadly similar to each other.

Strategies for hierarchical clustering generally fall into two types

These clustering method partitions the objects into k clusters and each partition form one cluster. This method is used to optimize an objective criterion similarity function such as when the distance is a major parameter example *K-means, CLARANS (Clustering Large Applications based upon randomized Search),* etc. This mechanism determines all the cluster at once.

One of the popular partitional clustering methods is

The centroids of the *K* clusters, which can be used to label new dataLabels for the training data (each data point is assigned to a single cluster)

It aims to partition a set of observations into several clusters (k), resulting in the partitioning of the data into Voronoi cells. It can be considered a method of finding out which group a certain object belongs to.

It is a Bottom-up hierarchical clustering algorithm, where we treat each data point as a single cluster to start with and then successively merge (or *agglomerate*) pairs of clusters until all clusters have been merged into a single cluster that contains all data points. While we do this process of merging the cluster based on similarity we end up with a tree called a *dendrogram*

To understand how the above merging happens we need to understand the following key concepts:

The distance between two given clusters is computed using the distance metrics like Euclidean distance. There are other distance metrics too, like Euclidean squared, Manhattan, etc., but the choice of a distance metric purely depends on the domain and the business problem statement which one is trying to solve.

Once we decide to select the right distance metric, one needs to decide from which point distance is being calculated. We achieve this by using various types of linkage options. H*ow *the pairs merge in Agglomerative clustering, involves calculating a dissimilarity between each merged pair and the other sample, which is achieved by the linkage criteria.

Here the distance between two clusters is the distance between the nearest neighbors in those clusters.

In this type of clustering (it is also called the *diameter* or *maximum* method), we consider the distance between one cluster and another cluster to be equal to the longest distance from any member of one cluster to any member of the other cluster.

In *average-link* clustering, we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster.

It is a top-down clustering method and is less commonly used. It works similarly to agglomerative clustering but, in the opposite direction.

Divisive Clustering method is rigid i.e., once a merging or splitting is done, it can never be undone

We will make use of Wholesale customer Data sets. This file has been sourced from UCI: https://archive.ics.uci.edu/ml/datasets/wholesale+customers

The data set refers to the clients of a wholesale distributor. It includes the annual spending in monetary units (m.u.) on diverse product categories

FRESH: annual spending (m.u.) on fresh products (Continuous)

MILK: annual spending (m.u.) on milk products (Continuous)

GROCERY: annual spending (m.u.)on grocery products (Continuous)

FROZEN: annual spending (m.u.)on frozen products (Continuous)

DETERGENTS_PAPER: annual spending (m.u.) on detergents and paper products (Continuous)

DELICATESSEN: annual spending (m.u.)on and delicatessen products (Continuous);

CHANNEL: customers Channel — Horeca (1: Hotel/Restaurant/Cafe) or Retail channel (Nominal).

We will try to cluster this given dataset into a customer segment, which is achieved based on the purchase made by various consumers. The intent is to cluster similar customers together using a hierarchical clustering technique

I have created a simple Kaggle notebook to help you get hands-on experience in solving the problem statement and achieve the desired objective. Here is the link:

We will look deeper into another important data clustering technique called **K-Means Clustering **.

Data science works well when the two exponent of machine learning mechanism named supervised and unsupervised works closely as a team. When you want your data to make more sense and to be structured well employ unsupervised learning methodology of data clustering & dimensionality reduction using PCA.Now this data will be more useful to be used in supervised learning algo to help you build effective models.