Clustering is a technique that involves grouping data based on certain characteristics. Clustering, as a task, falls under the umbrella of unsupervised learning. During inference, the clustering algorithm calculates the similarity between a query vector and the formed clusters (established during training) and assigns the query vector to the closest cluster.
People Mentioned
In this article, we will explore the concept of clustering in machine learning, delving into the different types of clustering and examining various algorithms within the clustering technique.
Before delving into the specifics of clustering, let's take a moment to survey the different types of learning in machine learning:
Supervised Learning:
In supervised learning, models are trained on labeled data, where the term "label" corresponds to the category of the input data. For instance, consider a spam classification system. Here, machine learning models undergo training on both spam and non-spam emails, each tagged with their respective label tags. Tasks such as classification and regression fall within the purview of supervised learning.
Unsupervised Learning:
Conversely, unsupervised learning stands in contrast to supervised learning, operating without any labeled data. In this approach, models are trained on data devoid of explicit labels. Take, for example, the realm of recommendation systems, which operates on unsupervised learning techniques. Clustering, as a task, falls under the umbrella of unsupervised learning.
While other learning types, such as reinforcement learning and semi-supervised learning, exist, they are not the focus of our discussion in this article.
Building upon the above explanation, it becomes evident that clustering falls under the domain of unsupervised learning, where the data provided to the model during training is unlabelled.
Now that we have a contextual understanding of where the concept of clustering emerges, it's time to delve into the essence of what clustering truly is.
What is Clustering?
In simple terms, clustering is a technique that involves grouping data based on certain characteristics. These groups of data points are referred to as "clusters." To illustrate this concept, let's explore a small example.
Consider the following sentences as a dataset:
Cricket is a fun game.
Cars are cool machines that help us travel from one place to another.
Teams take turns to bat and bowl.
Scoring lots of runs makes you a hero in cricket!
They have wheels, an engine, and can go really fast on roads.
Some cars are big like trucks, and others are small like tiny race cars.
By examining the properties and meanings of these sentences, we can group them into two categories: cricket and cars. This grouping aligns with what a clustering algorithm accomplishes during its training process. It captures the context (meaning) of the input data in numerical (vector) form and clusters the data points (vectors) based on their similarity.
During inference, the clustering algorithm calculates the similarity between a query vector and the formed clusters (established during training) and assigns the query vector to the closest cluster. We will delve into the specifics of similarity metrics later in this article.
How does the Clustering algorithm work?
While a brief introduction on how clustering works was provided earlier, let's now delve into the procedural perspective.
In the realm of machine learning, various clustering algorithms exist. However, the following procedure is a universal framework applicable to any clustering algorithm in use.
Data Preprocessing:
In the initial phase, text data undergoes preprocessing techniques like tokenization, lemmatization, and stop-word removal. This step enhances the quality of the data for subsequent analysis.
Embedding Generation:
Following preprocessing, we proceed to generate embeddings (vectors) for the processed text data. Various approaches exist for this task, ranging from traditional techniques like bag-of-words to more recent advancements such as transformers and language model pre-training.
Cluster Formation:
During the training of the clustering algorithm, clusters are formed based on the similarity between vectors. Metrics like cosine similarity or Euclidean distance are employed to calculate the similarity between vectors. It is essential for a robust clustering algorithm that intra-cluster distance is minimized while inter-cluster distance is maximized.
Assignment of New Data Points:
In the inference phase, the clustering algorithm compares a query vector to each cluster and assigns it to the cluster where the similarity score is high or the distance is low. This step ensures that new data points are accurately placed within existing clusters based on their similarity to cluster centroids.
Types of Clustering
There are different approaches to clustering a group of data points based on the similarity in their vector representation. A few of the techniques are discussed here :
Hierarchical clustering
Centroids-based Clustering
Density-based Clustering
Hierarchical clustering
Hierarchical clustering is also known as Connectivity-based Clustering, where data points are connected to neighbors based on their relationships (proximity distance)
The clusters are represented in a graph-like structure called a Dendrogram.
The X-axis of the dendrogram represents the objects that do not merge, while the Y-axis represents the distance at which clusters merge.
Similar data objects have minimal distance, falling into the same cluster, while dissimilar data objects are placed farther in the hierarchy.
There are two most common types of Hierarchical Clustering:
Agglomerative Clustering:
Also known as the AGNES algorithm, this type of clustering adopts a bottom-up approach.
Initially, all data points are treated as individual clusters, resulting in N clusters.
Through an iterative procedure, these N clusters gradually merge with neighbouring clusters based on their similarity.
By the end of the iterative process, all the individual clusters are amalgamated into a single large cluster containing N data points.
Divisive Clustering:
Also known as the DIANA algorithm, divisive clustering takes a top-down approach.
All data points start within a single large cluster.
Through an iterative process, this massive cluster breaks into sub-clusters and continues until a termination condition is met.
This approach contrasts with agglomerative clustering, providing a different perspective on hierarchical clustering methodologies.
Centroids-based Clustering
This clustering technique, also recognized as Partition-based Clustering, stands out as one of the simplest and most widely utilized methods in the realm of unsupervised learning. The renowned k-means clustering algorithm is a prominent example within this category.
In this approach, datasets are partitioned into a predetermined number of clusters, with each vector of values associated with a specific cluster. The process involves comparing a new vector with all cluster centroids and assigning it to the cluster with the minimum distance. This method offers a straightforward yet effective means of organizing data into distinct clusters.
Common distance metrics used for calculating cluster distances in this type of clustering technique include Euclidean distance, Manhattan distance, or Minkowski distance.
However, a notable limitation of this clustering approach is that the initial number of clusters (k) must be defined beforehand. This requirement can pose a challenge, as determining the optimal value for k may not always be straightforward and may require additional considerations or techniques.
Density-based Clustering
In contrast to the aforementioned clustering techniques, density-based clustering does not rely on distance metrics but instead considers the density of the data points' distribution.
Data points are clustered based on their densities, wherein areas with a high concentration of points are identified as dense clusters, while low-concentration areas are considered separate clusters.
Another notable distinction from the previously discussed techniques is the effective handling of outliers by density-based clustering algorithms. Outliers are designated as such, providing a more flexible approach compared to centroid-based clustering, where outliers are compelled to be assigned to one of the predefined clusters.
Prominent algorithms in this clustering technique include DBSCAN and OPTICS. These methods showcase the adaptability of density-based clustering in capturing intricate patterns within datasets.
Applications of Clustering
Let's explore scenarios where clustering becomes a preferred choice over other techniques in machine learning. Below are specific use cases where clustering proves to be highly effective:
Unlabelled Data or High Labelling Costs:
Scenario: When the available data is unlabelled, or the cost associated with labeling the data is prohibitively high.
Significance: Clustering is particularly valuable in scenarios where obtaining labeled data is impractical or costly.
Recommendation and Search Engines:
Scenario: Tasks such as recommendation engines and search engines leverage clustering techniques.
Example: Netflix employs clustering for movie suggestions, demonstrating the practical application of clustering in enhancing user experience and content discovery.
Customer Segmentation:
Scenario: In cases where businesses need to segment customers or markets based on the sales of a particular product or service.
Significance: Clustering aids in identifying distinct customer groups, allowing businesses to tailor their strategies based on specific customer characteristics.
Anomaly Detection in Finance and Banking:
Scenario: Detecting outliers and anomalies in financial data within the banking sector.
Significance: Clustering serves as a robust tool for identifying unusual patterns or transactions that deviate from the norm, contributing to enhanced fraud detection and risk management.
Document Classification and Retrieval:
Scenario: Classifying and retrieving documents based on their content.
Significance: Clustering techniques are instrumental in organizing and categorizing large volumes of documents, streamlining the document retrieval process.
Real-Time Problem Solving:
Scenario: Clustering is applied to address various real-time problems.
Significance: The versatility of clustering techniques makes them adaptable to a range of dynamic, real-world challenges, providing effective solutions in diverse contexts.
Conclusion
In conclusion, clustering emerges as a versatile tool, particularly in scenarios involving unlabelled data or where labeling proves cost-prohibitive. Its broad utility extends across various domains, including recommendation systems, customer segmentation, anomaly detection, document classification, and real-time problem-solving. For machine learning practitioners, clustering stands as a valuable asset within their toolkit, offering a powerful means to extract meaningful insights from complex and unlabeled datasets.