Home/Blog/Programming/What is clustering—An introduction

What is clustering—An introduction

Q: How is clustering different from classification?

- **Classification** is a supervised learning task in which we already have labeled data, and the goal is to predict those labels for new data points. - Conversely, **clustering** is an unsupervised task with no labels, and the goal is to group similar data points based on their features.

Q: What are the common challenges in clustering?

Some common challenges include: - Deciding the optimal number of clusters. - Handling noisy or overlapping data. - Choosing the right algorithm for the dataset. - Evaluating the quality of clusters, as clustering is unsupervised and lacks predefined labels for direct comparison.

Q: How do I evaluate the quality of clusters?

You can evaluate clustering using metrics like: - **[Silhouette score](https://www.educative.io/answers/what-is-silhouette-score):** Measures how well-separated and cohesive clusters are. - **Davies-Bouldin index:** Lower values indicate better clustering. - **Visual inspection:** Plotting the clusters if the data is 2D or 3D.

Q: Why is parameter tuning important in clustering?

Parameters like $k$ in $k$-means or `eps` and `min_samples` in DBSCAN directly affect the quality of clusters. Proper tuning ensures the clusters reflect meaningful patterns in the data. You can tune parameters by experimenting and using evaluation metrics to compare results.

6 min read

Mar 24, 2025

content

The k-means clustering algorithm

How cluster assignment works in k-means: An example

Python code for k-means clustering algorithm

Density-based clustering algorithm

How density-based clustering works in DBSCAN: An example

Python code for the DBSCAN clustering algorithm

Next steps

Become a Software Engineer in Months, Not Years

From your first line of code, to your first day on the job — Educative has you covered. Join 2M+ developers learning in-demand programming skills.

Key takeaways:

Clustering is an unsupervised learning technique that groups similar data points without needing predefined labels.
$k$ -means requires the number of clusters $k$ to be predefined and works by assigning data points to the nearest cluster based on distance.
Unlike $k$ -means, DBSCAN doesn’t need the number of clusters upfront and is effective for noisy or irregularly shaped data.
Success in clustering often depends on fine-tuning parameters.
Each algorithm has its limitations. $k$ -means struggles with noise, while DBSCAN may require careful parameter selection.

Machine learning has become versatile, with algorithms tailored to specific tasks and data characteristics. Clustering algorithms are critical when working with datasets that lack predefined labels. These algorithms group data points based on inherent patterns or similarities, offering valuable insights for tasks like pattern recognition, customer profiling, and detecting outliers in datasets.

To illustrate, imagine a bowl filled with balls of varying sizes and colors (with no additional context). Depending on the identified patterns, a clustering algorithm might group the balls by size, color, or combination. This process highlights how clustering reveals structure in unlabeled data, making it a powerful tool for exploratory data analysis.

Clustering is an unsupervised machine learning strategy for grouping data points into several groups or clusters. By arranging the data into a reasonable number of clusters, this approach helps to extract underlying patterns in the data and transform the raw data into meaningful knowledge.

Arranging the data into a reasonable number of clusters helps to extract underlying patterns in the data and transform the raw data into meaningful knowledge.

There are many clustering algorithms available, each designed for specific use cases. In this blog, we’ll explore two of the most popular ones.

The k-means clustering algorithm#

The $k$ -means clustering algorithm is one of the most commonly used clustering techniques. It partitions data into $k$ clusters, where $k$ is a user-defined input. The algorithm iteratively performs the following steps to achieve this:

Choose $k$ arbitrary centroids representing $k$ clusters (One common way to choose the initial centroids is to designate the first k data points as $k$ centroids.)
Compare each data point to all $k$ centroids and assign it to the closest cluster, identified using a distance function to compute the distance between points.
Recompute the centroids based on the new assignment. The mean of the data points in each cluster serves as the centroid.
Keep repeating steps 2 and 3 above until the cluster assignment (or cluster means) does not change or the maximum number of iterations is reached.

A Practical Guide to Machine Learning with Python

A Practical Guide to Machine Learning with Python

This course teaches you how to code basic machine learning models. The content is designed for beginners with general knowledge of machine learning, including common algorithms such as linear regression, logistic regression, SVM, KNN, decision trees, and more. If you need a refresher, we have summarized key concepts from machine learning, and there are overviews of specific algorithms dispersed throughout the course.

72hrs 30mins

Beginner

108 Playgrounds

12 Quizzes

Code explanation:

Below is a line-by-line explanation of the code:

Line 1: The KMeans class is imported from the sklearn.cluster package.
Line 2: The numpy library is imported to initialize a dataset for the program.
Line 3: The matplotlib.pyplot library is imported to visualize the outcomes.
Line 5: X is initialized as a numpy array. It contains eight data items with two features each.
Line 6: The KMeans constructor is configured for $k$ =2 and trained on X. The output is stored in the object clustering.
Line 7: Cluster assignment of each data point is extracted from clustering and stored in labels.
Line 8: A vector of colors is initialized and stored in colors.
Line 9: An image size for the output plot is declared.
Lines 10–12: Each data item is plotted in a scatter plot with a color corresponding to its cluster.

Density-based clustering algorithm#

When it’s impossible to determine the number of clusters ( $k$ ) beforehand, the $k$ -means clustering algorithm may not be suitable for clustering data. Another limitation of $k$ -means is its inability to differentiate noisy data points or outliers from others.

In contrast, density-based clustering does not require $k$ as an input parameter. Instead, it groups data points based on their proximity or density. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a commonly used density-based clustering algorithm.

DBSCAN requires two key parameters:

eps: The radius that defines the neighborhood of a data point.
min_samples: The minimum number of points required to form a cluster.

Data points outside the eps neighborhood and do not form a cluster of at least min_samples points are treated as noisy data points or outliers.

How density-based clustering works in DBSCAN: An example#

Here is a walk-through of the DBSCAN algorithm step by step:

Let’s break down the example step by step. In this case, DBSCAN forms clusters based on Euclidean distance with a predefined threshold eps = 3.

First, DBSCAN identifies points that are close enough to form a cluster. The points (1, 2), (2, 2), and (2, 3) create the first cluster since each pair is within the eps distance.
Next, (8, 7) and (8, 8) are evaluated. The distance among points is within eps, so they form a second cluster. However, they remain separate from the first cluster because they are too far from its points.
Similarly, (25, 80) and (23, 82) form a third cluster, as they are close to each other but too distant from the other clusters.
Finally, (100, 100) is analyzed. As it is not within eps of any cluster, it remains isolated. Given that min_samples = 2, it does not meet the clustering criteria and is classified as an outlier.

Python code for the DBSCAN clustering algorithm#

Here is the DBSCAN algorithm implemented in the same example:

Code explanation:

Let’s go through the code line by line:

Line 1: We’ve imported the DBSCAN class from the sklearn.cluster package.
Line 2: We’ve imported the numpy library to initialize a dataset for the program.
Line 3: The matplotlib.pyplot library is imported to visualize the outcomes.
Line 5: X has been initialized as a numpy array containing eight data items with two features.
Line 6: The DBSCAN constructor is configured for eps=3 and min_samples=2 and trained on X. The output is stored in the object clustering.
Line 7: Cluster assignment of each data point is extracted from clustering and stored in labels.
Line 8: A vector of colors is initialized and stored in colors.
Line 9: An image size for the output plot is declared.
Lines 10–12: Each data item is plotted in a scatter plot with a color corresponding to its cluster.

Feel free to play with the code of both algorithms (particularly the parameters each algorithm expects) and observe their impact on the output.

Next steps#

By now, you should have a solid grasp of clustering basics. If you’re ready to get hands-on with clustering, check out the Chemical Distillation Using Self-Organizing Maps project, where you’ll group ceramic samples based on their chemical composition and uncover meaningful patterns in the data. It’s a great way to sharpen your clustering skills and take a step closer to becoming a machine learning expert.

To explore machine learning and clustering in greater depth, consider diving into the following courses:

Frequently Asked Questions

What is an example of clustering?

Clustering can be illustrated using book genres. A clustering algorithm might determine that action and adventure genres are more similar than action and romance. As a result, action and adventure would be grouped into the same cluster.

How is clustering different from classification?

Classification is a supervised learning task in which we already have labeled data, and the goal is to predict those labels for new data points.
Conversely, clustering is an unsupervised task with no labels, and the goal is to group similar data points based on their features.

What are the common challenges in clustering?

Some common challenges include:

Deciding the optimal number of clusters.
Handling noisy or overlapping data.
Choosing the right algorithm for the dataset.
Evaluating the quality of clusters, as clustering is unsupervised and lacks predefined labels for direct comparison.

How do I evaluate the quality of clusters?

You can evaluate clustering using metrics like:

Silhouette score: Measures how well-separated and cohesive clusters are.
Davies-Bouldin index: Lower values indicate better clustering.
Visual inspection: Plotting the clusters if the data is 2D or 3D.

Can clustering handle noisy data?

Some clustering algorithms, like DBSCAN, can handle noise by treating outliers as separate points instead of forcing them into clusters. However, $k$ -means assumes all points belong to a cluster and is less effective for noisy data.

Why is parameter tuning important in clustering?

Parameters like $k$ in $k$ -means or eps and min_samples in DBSCAN directly affect the quality of clusters. Proper tuning ensures the clusters reflect meaningful patterns in the data. You can tune parameters by experimenting and using evaluation metrics to compare results.