Ready to apply -means clustering in Python? Get started today with this hands-on customer segmentation project!
Key takeaways:
Clustering is an unsupervised learning technique that groups similar data points without needing predefined labels.
Unlike
Success in clustering often depends on fine-tuning parameters.
Each algorithm has its limitations.
Machine learning has become versatile, with algorithms tailored to specific tasks and data characteristics. Clustering algorithms are critical when working with datasets that lack predefined labels. These algorithms group data points based on inherent patterns or similarities, offering valuable insights for tasks like pattern recognition, customer profiling, and detecting outliers in datasets.
To illustrate, imagine a bowl filled with balls of varying sizes and colors (with no additional context). Depending on the identified patterns, a clustering algorithm might group the balls by size, color, or combination. This process highlights how clustering reveals structure in unlabeled data, making it a powerful tool for exploratory data analysis.
Clustering is an unsupervised machine learning strategy for grouping data points into several groups or clusters. By arranging the data into a reasonable number of clusters, this approach helps to extract underlying patterns in the data and transform the raw data into meaningful knowledge.
Arranging the data into a reasonable number of clusters helps to extract underlying patterns in the data and transform the raw data into meaningful knowledge.
There are many clustering algorithms available, each designed for specific use cases. In this blog, we’ll explore two of the most popular ones.
The
Choose
Compare each data point to all
Recompute the centroids based on the new assignment. The mean of the data points in each cluster serves as the centroid.
Keep repeating steps 2 and 3 above until the cluster assignment (or cluster means) does not change or the maximum number of iterations is reached.
A Practical Guide to Machine Learning with Python
This course teaches you how to code basic machine learning models. The content is designed for beginners with general knowledge of machine learning, including common algorithms such as linear regression, logistic regression, SVM, KNN, decision trees, and more. If you need a refresher, we have summarized key concepts from machine learning, and there are overviews of specific algorithms dispersed throughout the course.
The distance from all centroids is computed to assign a data point to its closest cluster, and the closest cluster is decided. One of the most common distance functions used is Euclidean distance:
Where
The
Here’s the Python code of the
from sklearn.cluster import KMeansimport numpy as npimport matplotlib.pyplot as pltX = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80], [23, 82], [100, 100]])clustering = KMeans(n_clusters=2).fit(X)labels = clustering.labels_colors = ("red", "green", "blue", "pink", "magenta", "black", "yellow")plt.figure(figsize=(10, 10))for i in range(len(X)):plt.scatter(X[i][0], X[i][1], c=colors[labels[i]], marker='x', s=100)plt.show()
Code explanation:
Below is a line-by-line explanation of the code:
Line 1: The KMeans
class is imported from the sklearn.cluster
package.
Line 2: The numpy
library is imported to initialize a dataset for the program.
Line 3: The matplotlib.pyplot
library is imported to visualize the outcomes.
Line 5: X
is initialized as a numpy
array. It contains eight data items with two features each.
Line 6: The KMeans
constructor is configured for X
. The output is stored in the object clustering
.
Line 7: Cluster assignment of each data point is extracted from clustering
and stored in labels
.
Line 8: A vector of colors is initialized and stored in colors
.
Line 9: An image size for the output plot is declared.
Lines 10–12: Each data item is plotted in a scatter plot with a color corresponding to its cluster.
When it’s impossible to determine the number of clusters (
In contrast, density-based clustering does not require
DBSCAN requires two key parameters:
eps
: The radius that defines the neighborhood of a data point.
min_samples
: The minimum number of points required to form a cluster.
Data points outside the eps
neighborhood and do not form a cluster of at least min_samples
points are treated as noisy data points or outliers.
Here is a walk-through of the DBSCAN algorithm step by step:
Let’s break down the example step by step. In this case, DBSCAN forms clusters based on Euclidean distance with a predefined threshold eps = 3
.
First, DBSCAN identifies points that are close enough to form a cluster. The points (1, 2), (2, 2), and (2, 3) create the first cluster since each pair is within the eps
distance.
Next, (8, 7) and (8, 8) are evaluated. The distance among points is within eps
, so they form a second cluster. However, they remain separate from the first cluster because they are too far from its points.
Similarly, (25, 80) and (23, 82) form a third cluster, as they are close to each other but too distant from the other clusters.
Finally, (100, 100) is analyzed. As it is not within eps
of any cluster, it remains isolated. Given that min_samples = 2
, it does not meet the clustering criteria and is classified as an outlier.
Here is the DBSCAN algorithm implemented in the same example:
from sklearn.cluster import DBSCANimport numpy as npimport matplotlib.pyplot as pltX = np.array([[1, 2], [2, 2], [2, 3], [8, 7], [8, 8], [25, 80], [23, 82], [100, 100]])clustering = DBSCAN(eps=3, min_samples=2).fit(X)labels = clustering.labels_colors = ("red", "green", "blue", "pink")plt.figure(figsize=(10, 10))for i in range(len(X)):plt.scatter(X[i][0], X[i][1], c=colors[labels[i]], marker='x', s=100)plt.show()
Code explanation:
Let’s go through the code line by line:
Line 1: We’ve imported the DBSCAN
class from the sklearn.cluster
package.
Line 2: We’ve imported the numpy
library to initialize a dataset for the program.
Line 3: The matplotlib.pyplot
library is imported to visualize the outcomes.
Line 5: X
has been initialized as a numpy
array containing eight data items with two features.
Line 6: The DBSCAN
constructor is configured for eps=3
and min_samples=2
and trained on X
. The output is stored in the object clustering
.
Line 7: Cluster assignment of each data point is extracted from clustering
and stored in labels
.
Line 8: A vector of colors is initialized and stored in colors
.
Line 9: An image size for the output plot is declared.
Lines 10–12: Each data item is plotted in a scatter plot with a color corresponding to its cluster.
Feel free to play with the code of both algorithms (particularly the parameters each algorithm expects) and observe their impact on the output.
By now, you should have a solid grasp of clustering basics. If you’re ready to get hands-on with clustering, check out the Chemical Distillation Using Self-Organizing Maps project, where you’ll group ceramic samples based on their chemical composition and uncover meaningful patterns in the data. It’s a great way to sharpen your clustering skills and take a step closer to becoming a machine learning expert.
To explore machine learning and clustering in greater depth, consider diving into the following courses:
What is an example of clustering?
How is clustering different from classification?
What are the common challenges in clustering?
How do I evaluate the quality of clusters?
Can clustering handle noisy data?
Why is parameter tuning important in clustering?
Free Resources