In the ever-evolving realm of machine learning, where data-driven insights shape industries and innovations, there exists a versatile technique that enables us to uncover patterns, groupings, and relationships hidden within complex datasets. This technique, known as clustering, stands as a cornerstone in the foundation of machine learning's prowess. Clustering empowers computers to unravel intricate patterns within data points, allowing us to make informed decisions, uncover valuable insights, and even untangle the mysteries of the universe on a digital canvas.
Clustering is an unsupervised learning technique in machine learning and data mining that groups similar data points based on their inherent similarities or patterns. The goal of clustering algorithms is to identify natural clusters without prior knowledge. By analyzing the characteristics and features of data, clustering algorithms reveal hidden connections. This process provides valuable insights for decision-making, pattern recognition, and various applications like customer segmentation, anomaly detection, and image recognition. Clustering is a journey of exploration, leading to a deeper understanding of the information at hand.
There are two main types of clustering algorithms:
Partitional clustering seeks to partition data points into distinct, non-overlapping clusters. Every point is allocated to a single cluster, and the goal is to optimize a specific criterion, such as minimizing the sum of squared distances within clusters. k-means clustering is a widely used partitioning algorithm that divides data into
Density-based clustering (under partitional clustering) is a technique in machine learning that identifies clusters based on the density of data points. Density-based algorithms uncover clusters of varying shapes and sizes. They work by grouping together data points that are densely packed while separating sparser regions. One of the most notable algorithms in this category is
Hierarchical clustering creates a hierarchical structure of clusters. Hierarchical clustering can be performed in two ways: agglomerative or divisive. In agglomerative clustering, each data point is initially treated as a separate cluster, and clusters are iteratively merged based on proximity until a stopping point is reached. On the other hand, divisive clustering initiates with all data points within a single cluster and proceeds to iteratively divide clusters until a predefined stopping condition is satisfied. The result is typically visualized as a
In this blog, our focus will be mostly on agglomerative clustering within the hierarchical clustering domain.
Hierarchical clustering can further be classified into two main types: agglomerative clustering and divisive clustering.
Agglomerative clustering, referred to as bottom-up clustering, begins by considering each data point as an individual cluster. It then proceeds to iteratively merge the closest clusters until a specified stopping criterion is reached. During each iteration, the two closest clusters are combined into a larger cluster, leading to a hierarchical structure of clusters. Eventually, all data points belong to a single cluster, and the clustering outcome can be visualized as a dendrogram.
Divisive clustering, also known as top-down clustering, takes the opposite approach of agglomerative clustering. The process commences with all data points grouped into one cluster and subsequently proceeds to recursively divide the clusters into smaller ones until a specified stopping criterion is achieved. In each iteration, the algorithm selects a cluster and divides it into two or more smaller clusters based on a specified criterion. This process continues recursively until each data point is assigned to its own individual cluster.
As discussed previously, agglomerative clustering initiates with each data point considered as an individual cluster and gradually merges the closest clusters iteratively until all data points belong to a single cluster. It forms a hierarchy of clusters, which can be visualized as a dendrogram.
There are several types of strategies within agglomerative clustering:
Single linkage: This strategy measures the distance between two clusters by considering the minimum distance between any pair of points within them. It tends to create long, elongated clusters and is sensitive to noise and outlier data points.
Complete linkage: This strategy measures the distance between two clusters based on the greatest distance between any pair of points within the clusters. It focuses on the farthest points, potentially resulting in compact, spherical clusters.
Average linkage: This strategy calculates the mean distance between all pairs of points in two clusters. It aims to balance the effect of outliers and tends to produce clusters of relatively equal size.
Ward’s algorithm: This strategy aims to minimize the increase in the sum of squared distances when merging clusters, merging the closest clusters based on the within-cluster variance.
The following table discusses the main differences, advantages, and disadvantages of these different strategies:
Single Linkage | Complete Linkage | Average Linkage | Ward’s Algorithm | |
Differences |
|
|
|
|
Advantages |
|
|
|
|
Disadvantages |
|
|
|
|
The main differences in calculation between the different strategies (i.e., single, average, and complete) in agglomerative clustering lie in measuring the minimum distance, maximum distance, and average distance between points in different clusters, respectively. Below, we will see the single linkage method algorithm in more detail with an example.
The single linkage algorithm starts with individual data points as clusters and iteratively merges the closest clusters using the smallest distance between any pair of points within the clusters, updating the distance matrix at each step until the desired clustering structure is obtained. The steps are as follows:
Start with each data point as an individual cluster.
Calculate the distance matrix, which represents the pairwise distances between each pair of clusters based on the minimum distance between any two points in the clusters.
Determine the pair of clusters exhibiting the smallest distance in the distance matrix.
Merge the two closest clusters into a new cluster.
Update the distance matrix by recalculating the distances between the new cluster and the remaining clusters using the minimum distance criterion.
Repeat steps 2–5 until all data points belong to a single cluster or until a desired number of clusters is achieved.
The final result is a dendrogram or a set of clusters obtained based on the merging order.
Note: The stopping criterion for the algorithm can vary, such as reaching a predetermined number of clusters or a threshold distance value.
Let’s walk through an example of the single linkage method step by step. Suppose we have the following data points in a two-dimensional space:
Step 1: Start by considering each data point as an individual cluster. Clusters:
Step 2: Calculate the distance between every cluster pair using the single linkage method, which is based on the shortest distance between any two points within the clusters.
Note: In this example, we are using the Euclidean distance formula. We can use other distance calculations as well.
Recall the Euclidean distance formula:
Here, and are the th parameters of the and data instances, and is the number of features in each instance.
A | B | C | D | E | |
A | 0 | 1.414 | 4.243 | 5 | 5.385 |
B | 0 | 4.472 | 3.612 | 4.123 | |
C | 0 | 6.08 | 5.385 | ||
D | 0 | 1.414 | |||
E | 0 |
Step 3: Determine the pair of clusters with the smallest distance between them and combine them together to form a new cluster. From the above table, we see that the distance between points
Merge
Merge
Updated clusters:
Step 4: Recalculate the distances between the new cluster
Distance between
Distance between
Distance between
Step 5: Identify the pair of clusters with the minimum distance and combine them to form a new cluster:
Merge
Updated clusters:
Step 6: Recalculate the distances between the new cluster
Distance between
Step 7: Identify the cluster pair with the smallest distance and amalgamate them into a new cluster:
Merge
Final cluster:
Note that the other methods, namely complete linkage and average linkage methods, will essentially follow the same steps. The difference is only when the distances are recalculated. For example, in the complete linkage method, instead of min
, the max
method is used.
Let’s have a quick look at Ward’s algorithm.
Ward’s algorithm falls under the category of “centroid-based” linkage methods. It is related to the average linkage method.
In Ward’s algorithm, at every iteration, the two clusters with the smallest increase in the Ward’s criterion (sum of squared distances within clusters) are merged into a new cluster. The Ward’s criterion measures the increase in variance when two clusters are merged. It aims to minimize the increase in variance within the newly formed cluster after merging.
Suppose we have the following dataset with three data points (A, B, and C) in a one-dimensional space:
Step 1: Start with each data point as an individual cluster:
Step 2: Calculate the distance for every combination of cluster pairs:
Distance between
Distance between
Distance between
Step 3: Merge the two clusters that lead to the smallest increase in the Ward’s criterion. Since the minimum distance is 3 between
Updated clusters:
Step 4: Recalculate the distance between the new clusters and the remaining cluster(s):
Distance between
Final cluster:
Hierarchical clustering, serves as a powerful means of organizing and structuring datasets by grouping akin data points together, predicated upon the semblance of their characteristics. Among the various strategies the single linkage method approach shapes the inter-cluster distances with precision. This method intricately assesses the distances between clusters by pinpointing the minimal distance between any two individual points within each respective cluster, contributing to the delineation of distinct clusters within a dataset.
The single linkage method has the advantage of being computationally efficient and capable of detecting elongated and irregularly shaped clusters. However, it is sensitive to noise and outliers, as it tends to create long, chaining clusters. This sensitivity can lead to the formation of false connections between distant points.
Ward’s algorithm is a powerful technique used in data analysis and machine learning to group similar data points together based on their distance or dissimilarity. The algorithm aims to minimize the variance within clusters as it merges data points into clusters, leading to a hierarchical structure of clustered data.
For a much deeper dive into clustering and machine learning, we recommend exploring the following courses:
A Practical Guide to Machine Learning with Python
This course teaches you how to code basic machine learning models. The content is designed for beginners with general knowledge of machine learning, including common algorithms such as linear regression, logistic regression, SVM, KNN, decision trees, and more. If you need a refresher, we have summarized key concepts from machine learning, and there are overviews of specific algorithms dispersed throughout the course.
Mastering Machine Learning Theory and Practice
The machine learning field is rapidly advancing today due to the availability of large datasets and the ability to process big data efficiently. Moreover, several new techniques have produced groundbreaking results for standard machine learning problems. This course provides a detailed description of different machine learning algorithms and techniques, including regression, deep learning, reinforcement learning, Bayes nets, support vector machines (SVMs), and decision trees. The course also offers sufficient mathematical details for a deeper understanding of how different techniques work. An overview of the Python programming language and the fundamental theoretical aspects of ML, including probability theory and optimization, is also included. The course contains several practical coding exercises as well. By the end of the course, you will have a deep understanding of different machine-learning methods and the ability to choose the right method for different applications.
Free Resources