Home/Blog/Programming/What is clustering—An introduction
Home/Blog/Programming/What is clustering—An introduction

What is clustering—An introduction

6 min read
Mar 24, 2025
content
The k-means clustering algorithm
How cluster assignment works in k-means: An example
Python code for k-means clustering algorithm
Density-based clustering algorithm
How density-based clustering works in DBSCAN: An example
Python code for the DBSCAN clustering algorithm
Next steps

Become a Software Engineer in Months, Not Years

From your first line of code, to your first day on the job — Educative has you covered. Join 2M+ developers learning in-demand programming skills.

Key takeaways:

  • Clustering is an unsupervised learning technique that groups similar data points without needing predefined labels.

  • kk-means requires the number of clusters kk to be predefined and works by assigning data points to the nearest cluster based on distance.

  • Unlike kk-means, DBSCAN doesn’t need the number of clusters upfront and is effective for noisy or irregularly shaped data.

  • Success in clustering often depends on fine-tuning parameters.

  • Each algorithm has its limitations. kk-means struggles with noise, while DBSCAN may require careful parameter selection.

Machine learning has become versatile, with algorithms tailored to specific tasks and data characteristics. Clustering algorithms are critical when working with datasets that lack predefined labels. These algorithms group data points based on inherent patterns or similarities, offering valuable insights for tasks like pattern recognition, customer profiling, and detecting outliers in datasets.

To illustrate, imagine a bowl filled with balls of varying sizes and colors (with no additional context). Depending on the identified patterns, a clustering algorithm might group the balls by size, color, or combination. This process highlights how clustering reveals structure in unlabeled data, making it a powerful tool for exploratory data analysis.

Clustering is an unsupervised machine learning strategy for grouping data points into several groups or clusters. By arranging the data into a reasonable number of clusters, this approach helps to extract underlying patterns in the data and transform the raw data into meaningful knowledge.

Clustering of shapes data
Clustering of shapes data

Arranging the data into a reasonable number of clusters helps to extract underlying patterns in the data and transform the raw data into meaningful knowledge.

There are many clustering algorithms available, each designed for specific use cases. In this blog, we’ll explore two of the most popular ones.

The k-means clustering algorithm#

The kk-means clustering algorithm is one of the most commonly used clustering techniques. It partitions data into kk clusters, where kk is a user-defined input. The algorithm iteratively performs the following steps to achieve this:

  1. Choose kk arbitrary centroids representing kk clusters (One common way to choose the initial centroids is to designate the first k data points as kk centroids.)

  2. Compare each data point to all kk centroids and assign it to the closest cluster, identified using a distance function to compute the distance between points.

  3. Recompute the centroids based on the new assignment. The mean of the data points in each cluster serves as the centroid.

  4. Keep repeating steps 2 and 3 above until the cluster assignment (or cluster means) does not change or the maximum number of iterations is reached.

A Practical Guide to Machine Learning with Python

Cover
A Practical Guide to Machine Learning with Python

This course teaches you how to code basic machine learning models. The content is designed for beginners with general knowledge of machine learning, including common algorithms such as linear regression, logistic regression, SVM, KNN, decision trees, and more. If you need a refresher, we have summarized key concepts from machine learning, and there are overviews of specific algorithms dispersed throughout the course.

72hrs 30mins
Beginner
108 Playgrounds
12 Quizzes

How cluster assignment works in k-means: An example#

The distance from all centroids is computed to assign a data point to its closest cluster, and the closest cluster is decided. One of the most common distance functions used is Euclidean distance:

Where xix_i and yiy_i​ are the iith features of the x\mathbf{x} and y\mathbf{y} data instances, and nn is the number of features in each instance.

The kk-means algorithm is applied step by step to a small dataset to form two clusters using the Euclidean distance:

Initial data
1 / 7
Initial data

Python code for k-means clustering algorithm#

Here’s the Python code of the kk-means algorithm implemented in the same example:

from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80], [23, 82], [100, 100]])
clustering = KMeans(n_clusters=2).fit(X)
labels = clustering.labels_
colors = ("red", "green", "blue", "pink", "magenta", "black", "yellow")
plt.figure(figsize=(10, 10))
for i in range(len(X)):
plt.scatter(X[i][0], X[i][1], c=colors[labels[i]], marker='x', s=100)
plt.show()

Code explanation:

Below is a line-by-line explanation of the code:

  • Line 1: The KMeans class is imported from the sklearn.cluster package.

  • Line 2: The numpy library is imported to initialize a dataset for the program.

  • Line 3: The matplotlib.pyplot library is imported to visualize the outcomes.

  • Line 5: X is initialized as a numpy array. It contains eight data items with two features each.

  • Line 6: The KMeans constructor is configured for kk=2 and trained on X. The output is stored in the object clustering.

  • Line 7: Cluster assignment of each data point is extracted from clustering and stored in labels.

  • Line 8: A vector of colors is initialized and stored in colors.

  • Line 9: An image size for the output plot is declared.

  • Lines 10–12: Each data item is plotted in a scatter plot with a color corresponding to its cluster.

Customer Segmentation with K-Means Clustering

Customer Segmentation with K-Means Clustering

Ready to apply kk-means clustering in Python? Get started today with this hands-on customer segmentation project!

Ready to apply kk-means clustering in Python? Get started today with this hands-on customer segmentation project!

Density-based clustering algorithm#

When it’s impossible to determine the number of clusters (kk) beforehand, the kk-means clustering algorithm may not be suitable for clustering data. Another limitation of kk-means is its inability to differentiate noisy data points or outliers from others.

In contrast, density-based clustering does not require kk as an input parameter. Instead, it groups data points based on their proximity or density. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a commonly used density-based clustering algorithm.

DBSCAN requires two key parameters:

  • eps: The radius that defines the neighborhood of a data point.

  • min_samples: The minimum number of points required to form a cluster.

Data points outside the eps neighborhood and do not form a cluster of at least min_samples points are treated as noisy data points or outliers.

How density-based clustering works in DBSCAN: An example#

Here is a walk-through of the DBSCAN algorithm step by step:

DBSCAN clustering algorithm
1 / 5
DBSCAN clustering algorithm

Let’s break down the example step by step. In this case, DBSCAN forms clusters based on Euclidean distance with a predefined threshold eps = 3.

  • First, DBSCAN identifies points that are close enough to form a cluster. The points (1, 2), (2, 2), and (2, 3) create the first cluster since each pair is within the eps distance.

  • Next, (8, 7) and (8, 8) are evaluated. The distance among points is within eps, so they form a second cluster. However, they remain separate from the first cluster because they are too far from its points.

  • Similarly, (25, 80) and (23, 82) form a third cluster, as they are close to each other but too distant from the other clusters.

  • Finally, (100, 100) is analyzed. As it is not within eps of any cluster, it remains isolated. Given that min_samples = 2, it does not meet the clustering criteria and is classified as an outlier.

Python code for the DBSCAN clustering algorithm#

Here is the DBSCAN algorithm implemented in the same example:

from sklearn.cluster import DBSCAN
import numpy as np
import matplotlib.pyplot as plt
X = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80], [23, 82], [100, 100]])
clustering = DBSCAN(eps=3, min_samples=2).fit(X)
labels = clustering.labels_
colors = ("red", "green", "blue", "pink")
plt.figure(figsize=(10, 10))
for i in range(len(X)):
plt.scatter(X[i][0], X[i][1], c=colors[labels[i]], marker='x', s=100)
plt.show()

Code explanation:

Let’s go through the code line by line:

  • Line 1: We’ve imported the DBSCAN class from the sklearn.cluster package.

  • Line 2: We’ve imported the numpy library to initialize a dataset for the program.

  • Line 3: The matplotlib.pyplot library is imported to visualize the outcomes.

  • Line 5: X has been initialized as a numpy array containing eight data items with two features.

  • Line 6: The DBSCAN constructor is configured for eps=3 and min_s​amples=2 and trained on X. The output is stored in the object clustering.

  • Line 7: Cluster assignment of each data point is extracted from clustering and stored in labels.

  • Line 8: A vector of colors is initialized and stored in colors.

  • Line 9: An image size for the output plot is declared.

  • Lines 10–12: Each data item is plotted in a scatter plot with a color corresponding to its cluster.

Feel free to play with the code of both algorithms (particularly the parameters each algorithm expects) and observe their impact on the output.

Next steps#

By now, you should have a solid grasp of clustering basics. If you’re ready to get hands-on with clustering, check out the Chemical Distillation Using Self-Organizing Maps project, where you’ll group ceramic samples based on their chemical composition and uncover meaningful patterns in the data. It’s a great way to sharpen your clustering skills and take a step closer to becoming a machine learning expert.

To explore machine learning and clustering in greater depth, consider diving into the following courses:


Frequently Asked Questions

What is an example of clustering?

Clustering can be illustrated using book genres. A clustering algorithm might determine that action and adventure genres are more similar than action and romance. As a result, action and adventure would be grouped into the same cluster.

How is clustering different from classification?

What are the common challenges in clustering?

How do I evaluate the quality of clusters?

Can clustering handle noisy data?

Why is parameter tuning important in clustering?


Written By:
Malik Jahan
Join 2.5 million developers at
Explore the catalog

Free Resources