Home/Blog/Data Science/Hierarchical Clustering

Hierarchical Clustering

13 min read

Oct 20, 2023

content

What is Clustering?

Agglomerative Clustering

Single Linkage Algorithm

Example

Ward's Algorithm

Example

Conclusion and Next Steps

Become a Software Engineer in Months, Not Years

From your first line of code, to your first day on the job — Educative has you covered. Join 2M+ developers learning in-demand programming skills.

In the ever-evolving realm of machine learning, where data-driven insights shape industries and innovations, there exists a versatile technique that enables us to uncover patterns, groupings, and relationships hidden within complex datasets. This technique, known as clustering, stands as a cornerstone in the foundation of machine learning's prowess. Clustering empowers computers to unravel intricate patterns within data points, allowing us to make informed decisions, uncover valuable insights, and even untangle the mysteries of the universe on a digital canvas.

What is Clustering?#

Clustering is an unsupervised learning technique in machine learning and data mining that groups similar data points based on their inherent similarities or patterns. The goal of clustering algorithms is to identify natural clusters without prior knowledge. By analyzing the characteristics and features of data, clustering algorithms reveal hidden connections. This process provides valuable insights for decision-making, pattern recognition, and various applications like customer segmentation, anomaly detection, and image recognition. Clustering is a journey of exploration, leading to a deeper understanding of the information at hand.

There are two main types of clustering algorithms:

Partitional clustering seeks to partition data points into distinct, non-overlapping clusters. Every point is allocated to a single cluster, and the goal is to optimize a specific criterion, such as minimizing the sum of squared distances within clusters. k-means clustering is a widely used partitioning algorithm that divides data into $k$ clusters by considering the proximity of data points to the cluster centroids. It repetitively assigns data points to the nearest centroid and updates the centroid position until the process converges.

Density-based clustering (under partitional clustering) is a technique in machine learning that identifies clusters based on the density of data points. Density-based algorithms uncover clusters of varying shapes and sizes. They work by grouping together data points that are densely packed while separating sparser regions. One of the most notable algorithms in this category is A type of diagram used to visualize hierarchical relationships or clustering of dataDBSCANDensity-Based Spatial Clustering of Applications with Noise, which effectively detects clusters in noisy and complex datasets, making it a valuable tool for uncovering hidden patterns in real-world data.

Hierarchical clustering creates a hierarchical structure of clusters. Hierarchical clustering can be performed in two ways: agglomerative or divisive. In agglomerative clustering, each data point is initially treated as a separate cluster, and clusters are iteratively merged based on proximity until a stopping point is reached. On the other hand, divisive clustering initiates with all data points within a single cluster and proceeds to iteratively divide clusters until a predefined stopping condition is satisfied. The result is typically visualized as a dendrogramA type of diagram used to visualize hierarchical relationships or clustering of data which shows the merging or splitting process at different levels of granularity.

In this blog, our focus will be mostly on agglomerative clustering within the hierarchical clustering domain.

Hierarchical Clustering#

Hierarchical clustering can further be classified into two main types: agglomerative clustering and divisive clustering.

Agglomerative clustering, referred to as bottom-up clustering, begins by considering each data point as an individual cluster. It then proceeds to iteratively merge the closest clusters until a specified stopping criterion is reached. During each iteration, the two closest clusters are combined into a larger cluster, leading to a hierarchical structure of clusters. Eventually, all data points belong to a single cluster, and the clustering outcome can be visualized as a dendrogram.

Divisive clustering, also known as top-down clustering, takes the opposite approach of agglomerative clustering. The process commences with all data points grouped into one cluster and subsequently proceeds to recursively divide the clusters into smaller ones until a specified stopping criterion is achieved. In each iteration, the algorithm selects a cluster and divides it into two or more smaller clusters based on a specified criterion. This process continues recursively until each data point is assigned to its own individual cluster.

Agglomerative Clustering#

As discussed previously, agglomerative clustering initiates with each data point considered as an individual cluster and gradually merges the closest clusters iteratively until all data points belong to a single cluster. It forms a hierarchy of clusters, which can be visualized as a dendrogram.

There are several types of strategies within agglomerative clustering:

Single linkage: This strategy measures the distance between two clusters by considering the minimum distance between any pair of points within them. It tends to create long, elongated clusters and is sensitive to noise and outlier data points.
Complete linkage: This strategy measures the distance between two clusters based on the greatest distance between any pair of points within the clusters. It focuses on the farthest points, potentially resulting in compact, spherical clusters.
Average linkage: This strategy calculates the mean distance between all pairs of points in two clusters. It aims to balance the effect of outliers and tends to produce clusters of relatively equal size.
Ward’s algorithm: This strategy aims to minimize the increase in the sum of squared distances when merging clusters, merging the closest clusters based on the within-cluster variance.

The following table discusses the main differences, advantages, and disadvantages of these different strategies:

	Single Linkage	Complete Linkage	Average Linkage	Ward’s Algorithm
Differences	Measures the distance between clusters based on the minimum distances between any two points in the cluster. Tends to create elongated clusters.	Measures the distance between clusters based on the maximum distance between any two points in the cluster. Tends to create compact clusters.	Measures the average distance between clusters. Balances effects of outliers.	Minimizes the increase in variance.
Advantages	Can handle arbitrary shapes and sizes of clusters.	Can produce compact clusters,	Balances outliers and produces relatively equal-sized clusters.	Produces compact and balanced clusters.
Disadvantages	Sensitive to noise and outliers. Can create elongated clusters.	Tends to create tight, spherical clusters.	Might create large clusters due to an averaging effect.	Sensitive to noise and outliers.

Start with each data point as an individual cluster.
Calculate the distance matrix, which represents the pairwise distances between each pair of clusters based on the minimum distance between any two points in the clusters.
Determine the pair of clusters exhibiting the smallest distance in the distance matrix.
Merge the two closest clusters into a new cluster.
Update the distance matrix by recalculating the distances between the new cluster and the remaining clusters using the minimum distance criterion.
Repeat steps 2–5 until all data points belong to a single cluster or until a desired number of clusters is achieved.
The final result is a dendrogram or a set of clusters obtained based on the merging order.

Step 4: Recalculate the distances between the new cluster $[A, B]$ , $[C]$ , and $[D, E]$ :
- Distance between $[A, B]$ and $[C]$ = $min(dist(A, C), dist(B, C))$ = $min(sqrt((5-2)^2 + (8 - 5)^2), sqrt((5-3)^2 + (8-4)^2))$ = 4.243
- Distance between $[A, B]$ and $[D, E]$ = Minimum distance between any point in $[A, B]$ and any point in $[D, E]$ = $min(dist(A, D), dist(A, E), dist(B, D), dist(B, E)$ = $min(sqrt((6 - 2)^2 + (2 - 5)^2), sqrt((7 - 2)^2 + (3 - 5)^2, sqrt((6 - 3)^2 + (2 - 4)^2), sqrt((7 - 3)^2 + (3 - 4)^2))$ = 3.162
- Distance between $[C]$ and $[D, E]$ = $min(dist(C, D), dist(C,E)$ = $min(sqrt((5 - 6)^2 +(8 - 2)^2), sqrt(5 - 7)^2, sqrt(8 - 3)^2))$ = 5.385
Step 5: Identify the pair of clusters with the minimum distance and combine them to form a new cluster:
- Merge $[A, B]$ and $[D, E]$ into $[[A, B], [D, E]]$
- Updated clusters: $[[A, B], [D, E]]$ , $[C]$

Step 6: Recalculate the distances between the new cluster $[[A, B], [D, E]]$ and $[C]$ :
- Distance between $[[A, B], [D, E]]$ and $[C]$ = Minimum distance between any point in $[[A, B], [D, E]]$ and $[C]$ = 4.243
Step 7: Identify the cluster pair with the smallest distance and amalgamate them into a new cluster:
- Merge $[[A, B], [D, E]]$ and $[C]$ into $[[[A, B], [D, E]], [C]]$
- Final cluster: $[[[A, B], [D, E]], [C]]$

Note that the other methods, namely complete linkage and average linkage methods, will essentially follow the same steps. The difference is only when the distances are recalculated. For example, in the complete linkage method, instead of min, the max method is used.

Let’s have a quick look at Ward’s algorithm.

Ward's Algorithm#

Ward’s algorithm falls under the category of “centroid-based” linkage methods. It is related to the average linkage method.

In Ward’s algorithm, at every iteration, the two clusters with the smallest increase in the Ward’s criterion (sum of squared distances within clusters) are merged into a new cluster. The Ward’s criterion measures the increase in variance when two clusters are merged. It aims to minimize the increase in variance within the newly formed cluster after merging.

Example#

Suppose we have the following dataset with three data points (A, B, and C) in a one-dimensional space:

$A(3), B(7), C(10)$

Step 1: Start with each data point as an individual cluster: $[A], [B], [C]$

Hierarchical clustering, serves as a powerful means of organizing and structuring datasets by grouping akin data points together, predicated upon the semblance of their characteristics. Among the various strategies the single linkage method approach shapes the inter-cluster distances with precision. This method intricately assesses the distances between clusters by pinpointing the minimal distance between any two individual points within each respective cluster, contributing to the delineation of distinct clusters within a dataset.

The single linkage method has the advantage of being computationally efficient and capable of detecting elongated and irregularly shaped clusters. However, it is sensitive to noise and outliers, as it tends to create long, chaining clusters. This sensitivity can lead to the formation of false connections between distant points.

Ward’s algorithm is a powerful technique used in data analysis and machine learning to group similar data points together based on their distance or dissimilarity. The algorithm aims to minimize the variance within clusters as it merges data points into clusters, leading to a hierarchical structure of clustered data.

For a much deeper dive into clustering and machine learning, we recommend exploring the following courses:

A Practical Guide to Machine Learning with Python

A Practical Guide to Machine Learning with Python

This course teaches you how to code basic machine learning models. The content is designed for beginners with general knowledge of machine learning, including common algorithms such as linear regression, logistic regression, SVM, KNN, decision trees, and more. If you need a refresher, we have summarized key concepts from machine learning, and there are overviews of specific algorithms dispersed throughout the course.

72hrs 30mins

Beginner

108 Playgrounds

12 Quizzes

Mastering Machine Learning Theory and Practice

The machine learning field is rapidly advancing today due to the availability of large datasets and the ability to process big data efficiently. Moreover, several new techniques have produced groundbreaking results for standard machine learning problems. This course provides a detailed description of different machine learning algorithms and techniques, including regression, deep learning, reinforcement learning, Bayes nets, support vector machines (SVMs), and decision trees. The course also offers sufficient mathematical details for a deeper understanding of how different techniques work. An overview of the Python programming language and the fundamental theoretical aspects of ML, including probability theory and optimization, is also included. The course contains several practical coding exercises as well. By the end of the course, you will have a deep understanding of different machine-learning methods and the ability to choose the right method for different applications.

36hrs

Beginner

109 Playgrounds

10 Quizzes

Written By:

Kamran Lodhi