Imagine waking up to a long to-do list with a jumble of tasks with no clear order. Clustering is like having an intuitive assistant that groups similar tasks in our list, making it easier to prioritize and accomplish them efficiently.
Let's consider an example of a cabinet filled with clothes. We want to organize the cabinet so that it becomes easier to find what we need. This is where clustering can be helpful. We can think of the clothes as data points and associate features for each data point to define it, as shown below:
We can start by category; all shirts would go in one group, and all pants would go in another.
We can group clothes with similar colors so that they match well.
We can further organize clothes based on the season—for instance, warm clothes for winter and lighter clothes for summer.
Clustering can help organize our cabinet by grouping similar items. This helps us find what we need faster, ensure that our outfits are coordinated, and simplify maintaining an organized cabinet.
In the context of machine learning, clustering is a type of unsupervised learning that partitions the dataset into distinct groups. In contrast to classification algorithms, clustering algorithms learn from unlabeled data to uncover underlying patterns without prior information. Clustering algorithms identify patterns in the dataset based on similarity or distance between data points.
In this blog, we'll look at the various clustering types and the common algorithms for the clustering types. We'll also cover the most common use cases for each clustering type.
Centroid-based clustering partitions the data into nonoverlapping clusters around centroids. A centroid represents the average of all the data points in a cluster. During clustering, a data point is assigned to the nearest centroid.
K-means clustering is a popular centroid-based clustering algorithm. K-means clustering iteratively associates each data point with a centroid by comparing the sum of distances between the data point and the centroids. This measure is commonly referred to as a distance metric. The algorithm assigns each data point to a cluster so that the distance metric is minimized. Other centroid-based clustering algorithms include k-medoids clustering and fuzzy c-means.
Centroid-based clustering is widely used for its simplicity and efficiency, particularly with large datasets. Let's look at some of the common use cases:
Marketing: We can segment the customers using centroid-based clustering algorithms into clusters based on features like purchasing behavior, demographics, age, and gender.
Anomaly detection: We can detect a behavior that deviates from the norm. Any data points that are far from any centroid can be considered anomalies.
Document clustering: We can identify document similarities using centroid-based algorithms where each centroid defines the genre.
Density-based clustering partitions data into clusters based on the density of data points. It identifies dense regions of data points separated by sparser regions to specify the number of clusters.
Density-based spatial clustering of applications with noise (DBSCAN) categorizes data points into three categories:
Core points: These are surrounded by several other data points within a defined radius.
Border points: These points are within the defined radius of a core point but do not have the minimum number of surrounding points.
Noise points: These data points do not fall within the neighborhood of any core point.
Density-based clustering methods are useful in scenarios where clusters may have irregular shapes and sizes, as shown above. Let’s explore some common applications of density-based clustering:
Anomaly detection: We can identify data points that dont fall into any neighborhood of core points and can be considered as outliers or anomalies.
Spatial data analysis: Due to their irregular shapes, these methods can cluster spatial data points based on their proximity and provides valuable information in location-based services.
Image segmentation: We can segment pixels in an image based on their spatial density to extract meaningful regions or objects.
Oftentimes, the data points follow a particular well-defined distribution. That’s where the distribution-based clustering methods come into play. These methods assume the data is generated using several distributions and group the data points into clusters that are more likely to share a common distribution. This approach uses probability as a metric to associate each data point to a cluster. One commonly used method for distribution-based clustering is the Gaussian mixture model. As the name suggests, the Gaussian mixture model assumes that the input data is generated from a mixture of several Gaussian distributions.
The Gaussian mixture model defines clusters using Gaussian component parameters like mean and variance. The model then associates each data point to a cluster with the highest probability of it belonging to the Gaussian component.
Let’s look at the common applications where distribution-based clustering methods are useful:
Finance and economics: We can identify trends or patterns in stock prices using the stock’s price time series.
Biology and genetics: We can cluster the genetic sequences or biological samples based on similarities in their molecular profiles.
Image processing: We can denoise images corrupted with Gaussian noise. Common denoising applications can be found in medical imaging, astronomy, and computer vision.
Hierarchical clustering builds a hierarchy of clusters. It successively merges or splits existing clusters to create a tree-like structure where the data points are grouped at different levels of granularity.
There are two main types of hierarchical clustering:
Agglomerative hierarchical clustering: Here, we start by having the same number of clusters as the number of data points. The algorithm then iteratively merges the closest data points together until only one cluster remains.
Divisive hierarchical clustering: Alternatively, we can start with the assumption that all data points belong to a single cluster and then recursively divide them into smaller clusters until each data point is its own cluster.
Let’s look at some of the applications of hierarchical clustering:
Biology: This type of classification helps classify species based on genetic or morphological traits to construct phylogenetic trees.
Social network analysis: We can group users for social network analysis based on their similarities in interests.
This blog provides a brief overview of different types of clustering algorithms in machine learning. This helps in making an informed decision about which clustering approach is best suited for our specific data, leading to an effective and meaningful analysis. If you want to learn more about how these clustering algorithms are implemented, we encourage you to check out the following courses at Educative.
Simplifying Machine Learning with PyCaret in Python
PyCaret is an low-code, open-source, machine learning library for Python. It can be used in machine learning tasks such as data preparation and model deployment. In this course, you will learn multiple topics related to machine learning. You will start with a brief introduction to the basic concepts of machine learning, and then continue with case studies of regression, classification, clustering, and anomaly detection based on the respective modules of the PyCaret library. Finally, we will focus on using the Streamlit library to develop and deploy machine learning applications. By the end of this course, you will have the skills to deploy the robust PyCaret library for any of your machine learning projects.
A Practical Guide to Machine Learning with Python
This course teaches you how to code basic machine learning models. The content is designed for beginners with general knowledge of machine learning, including common algorithms such as linear regression, logistic regression, SVM, KNN, decision trees, and more. If you need a refresher, we have summarized key concepts from machine learning, and there are overviews of specific algorithms dispersed throughout the course.
An Introductory Guide to Data Science and Machine Learning
There is a lot of dispersed, and somewhat conflicting information on the internet when it comes to data science, making it tough to know where to start. Don't worry. This course will get you familiar with the state of data science and the related fields such as machine learning and big data. You will be going through the fundamental concepts and libraries which are essential to solve any problem in this field. You will work on real-time projects from Kaggle while also honing your mathematical skills which will be used extensively in most problems you face. You will also be taken through a systematic approach to learning about data acquisition to data wrangling and everything in between. This is your all-in-one guide to becoming a confident data scientist.
Free Resources