Home/Blog/Programming/What is Clustering: An Introduction
Home/Blog/Programming/What is Clustering: An Introduction

What is Clustering: An Introduction

Malik Jahan
5 min read

Become a Software Engineer in Months, Not Years

From your first line of code, to your first day on the job — Educative has you covered. Join 2M+ developers learning in-demand programming skills.

Machine learning has evolved as a panoramic field and is applied across a wide variety of disciplines. The selection and application of most machine learning algorithms primarily depend on the nature of the task and the dataset. If the dataset contains a set of instances or data points that don’t have a pre-determined label, then the clustering algorithms are expected to process the data and attempt to extract different patterns. For example, if a bowl contains a mix of balls of different sizes and colors (with no additional information) and the task is to come up with appropriate number of groups of balls then this is an example of the task of clustering.

Clustering is an unsupervised learning strategy to group the given set of data points into a number of groups or clusters.

Arranging the data into a reasonable number of clusters helps to extract underlying patterns in the data and transform the raw data into meaningful knowledge. Example application areas include the following:

  • Pattern recognition
  • Image segmentation
  • Profiling users or customers
  • Categorization of objects into a number of categories or groups
  • Detection of outliers or noise in a pool of data items

Given a dataset, distribute the data into an appropriate number of clusters. In the literature, there are many clustering algorithms. The next sections explore two popular clustering algorithms.

The k-means clustering algorithm#

The kk-means clustering algorithm is one of the most commonly used clustering algorithms. It clusters the given data into kk clusters. The algorithm expects kk to be defined as the input to the algorithm. It’s an iterative algorithm and performs the following steps to cluster the given data into kk clusters:

  1. Choose kk arbitrary centroids representing kk clusters (One common way to choose the initial centroids is to designate the first kk data points as kk centroids.)
  2. Compare each data point with all kk centroids and assign them to the closest clusters. An appropriate distance function is used to compute the distance between two data items.
  3. Recompute the centroids based on the new assignment. The mean of the data items of each cluster serves as the centroid.
  4. Keep repeating steps 2 and 3 until there is no change on cluster assignment (or mean of clusters), or an upper limit on the number of iterations is reached.

In order to compute the assignment of a data point in the closest cluster, its distance from all centroids is computed, and the closest cluster is decided. One of the most common distance functions used is Euclidean distance:

where xix_i and yiy_i are the iith parameters of the x\mathbf{x} and y\mathbf{y} data instances and nn is the number of features in each instance.

These steps are executed on a small dataset step-by-step to form two clusters below:

Example: k-means clustering algorithm (Slide 1)
1 of 7

Here is the kk-means algorithm implemented in the same example:

from sklearn.cluster import KMeans
import numpy as np
import matplotlib.pyplot as plt
X = np.array([[1,2],[2,2],[2,3],[8,7],[8,8],[25,80],[23,82],[100,100]])
clustering = KMeans(n_clusters=2).fit(X)
labels = clustering.labels_
colors = ("red","green","blue","pink","magenta","black","yellow")
for i in range(len(X)):
plt.scatter(X[i][0],X[i][1], c = colors[labels[i]], marker = 'x')
plt.savefig("output/scatter.png")

Below is a line-by-line explanation of the code:

  • Line 1: The KMeans class is imported from sklearn.cluster package.
  • Line 2: The numpy library is imported to initialize a dataset to be used in the program.
  • Line 3: The matplotlib.pyplot library is imported to visualize the outcomes.
  • Line 5: X is initialized as an numpy array. It contains eight data items with two features each.
  • Line 6: The KMeans constructor is configured for k=2k=2 and trained on X. The output is stored in the object clustering.
  • Line 7: Cluster assignment of each data point is extracted from clustering and stored in labels.
  • Line 8: A vector of colors is initialized and stored in colors.
  • Lines 9-11: Each data item is plotted in a scatter plot with a color corresponding to its cluster.

Density-based clustering algorithm#

When it’s not possible to decide the number of clusters kk beforehand then the kk-means clustering algorithm is not a good choice to cluster the data. Another bottleneck of the kk-means algorithm is that it doesn’t differentiate the noisy data points or outliers from the other data points.

Density-based clustering doesn’t expect kk as one of its input parameters. Instead, it clusters the given data based on the proximity (density) of the data points. One of the commonly used density-based clustering algorithms is DBSCAN (density-based spatial clustering of applications with noise). The algorithm expects the threshold epseps to define neighborhood of a data point and min_samplesmin\_samples as the minimum acceptable size of a cluster. Data points which fall out of epseps neighborhood and don’t make a cluster of the smallest possible size min_samplesmin\_samples are treated as noisy data points or outliers.

Here is a walk through of the DBSCAN algorithm step-by-step:

Example: DBSCAN clustering algorithm (Slide 1)
1 of 5

Here is the DBSCAN algorithm implemented in the same example:

from sklearn.cluster import DBSCAN
import numpy as np
import matplotlib.pyplot as plt
X = np.array([[1,2],[2,2],[2,3],[8,7],[8,8],[25,80],[23,82],[100,100]])
clustering = DBSCAN(eps=3, min_samples=2).fit(X)
labels = clustering.labels_
colors = ("red","green","blue","pink")
for i in range(len(X)):
plt.scatter(X[i][0],X[i][1], c = colors[labels[i]], marker = 'x')
plt.savefig("output/scatter.png")

Let’s go through the code line by line:

  • Line 1: Import the DBSCAN class from sklearn.cluster package.
  • Line 2: The numpy library is imported to initialize a dataset to be used in the program.
  • Line 3: matplotlib.pyplot library is imported to visualize the outcomes.
  • Line 5: The X is initialized as an numpy array. It contains eight data items with two features each.
  • Line 6: The DBSCAN constructor is configured for eps=3eps=3 and min_samples=2min\_samples=2 and trained on X. The output is stored in the object clustering.
  • Line 7: Cluster assignment of each data point is extracted from clustering and stored in labels.
  • Line 8: A vector of colors is initialized and stored in colors.
  • Lines 9-11: Each data item is plotted in a scatter plot with a color corresponding to its cluster.

Feel free to play with the code of both algorithms (particularly the parameters each algorithm expects) and observe their impact on the output.

By now, you probably have a good grasp of the basics of clustering and are ready to start your journey to become a machine learning expert.

For a much deeper dive into machine learning and clustering, explore the following courses:

A Practical Guide to Machine Learning with Python

Cover
A Practical Guide to Machine Learning with Python

This course teaches you how to code basic machine learning models. The content is designed for beginners with general knowledge of machine learning, including common algorithms such as linear regression, logistic regression, SVM, KNN, decision trees, and more. If you need a refresher, we have summarized key concepts from machine learning, and there are overviews of specific algorithms dispersed throughout the course.

72hrs 30mins
Beginner
108 Playgrounds
12 Quizzes

Mastering Machine Learning Theory and Practice

Cover
Mastering Machine Learning Theory and Practice

The machine learning field is rapidly advancing today due to the availability of large datasets and the ability to process big data efficiently. Moreover, several new techniques have produced groundbreaking results for standard machine learning problems. This course provides a detailed description of different machine learning algorithms and techniques, including regression, deep learning, reinforcement learning, Bayes nets, support vector machines (SVMs), and decision trees. The course also offers sufficient mathematical details for a deeper understanding of how different techniques work. An overview of the Python programming language and the fundamental theoretical aspects of ML, including probability theory and optimization, is also included. The course contains several practical coding exercises as well. By the end of the course, you will have a deep understanding of different machine-learning methods and the ability to choose the right method for different applications.

36hrs
Beginner
109 Playgrounds
10 Quizzes

Hands-on Machine Learning with Scikit-Learn

Cover
Hands-on Machine Learning with Scikit-Learn

Scikit-Learn is a powerful library that provides a handful of supervised and unsupervised learning algorithms. If you’re serious about having a career in machine learning, then scikit-learn is a must know. In this course, you will start by learning the various built-in datasets that scikit-learn offers, such as iris and mnist. You will then learn about feature engineering and more specifically, feature selection, feature extraction, and dimension reduction. In the latter half of the course, you will dive into linear and logistic regression where you’ll work through a few challenges to test your understanding. Lastly, you will focus on unsupervised learning and deep learning where you’ll get into k-means clustering and neural networks. By the end of this course, you will have a great new skill to add to your resume, and you’ll be ready to start working on your own projects that will utilize scikit-learn.

5hrs
Intermediate
5 Challenges
2 Quizzes

Frequently Asked Questions

What is an example of clustering?

You can take clustering in book genres as an example. A specific clustering method might determine that ‘action’ and ‘adventure’ genres are more similar to each other compared to ‘action’ and ‘romance.’ As a result, you would cluster ‘action’ and ‘adventure.’


  

Free Resources