Ready to apply -means clustering in Python? Get started today with this hands-on customer segmentation project!
What is clustering—An introduction
Key takeaways:
Clustering is an unsupervised learning technique that groups similar data points without needing predefined labels.
-means requires the number of clusters to be predefined and works by assigning data points to the nearest cluster based on distance. Unlike
-means, DBSCAN doesn’t need the number of clusters upfront and is effective for noisy or irregularly shaped data. Success in clustering often depends on fine-tuning parameters.
Each algorithm has its limitations.
-means struggles with noise, while DBSCAN may require careful parameter selection.
Machine learning has become versatile, with algorithms tailored to specific tasks and data characteristics. Clustering algorithms are critical when working with datasets that lack predefined labels. These algorithms group data points based on inherent patterns or similarities, offering valuable insights for tasks like pattern recognition, customer profiling, and detecting outliers in datasets.
To illustrate, imagine a bowl filled with balls of varying sizes and colors (with no additional context). Depending on the identified patterns, a clustering algorithm might group the balls by size, color, or combination. This process highlights how clustering reveals structure in unlabeled data, making it a powerful tool for exploratory data analysis.
Clustering is an unsupervised machine learning strategy for grouping data points into several groups or clusters. By arranging the data into a reasonable number of clusters, this approach helps to extract underlying patterns in the data and transform the raw data into meaningful knowledge.
Arranging the data into a reasonable number of clusters helps to extract underlying patterns in the data and transform the raw data into meaningful knowledge.
There are many clustering algorithms available, each designed for specific use cases. In this blog, we’ll explore two of the most popular ones.
The k-means clustering algorithm#
The
Choose
arbitrary centroids representing clusters (One common way to choose the initial centroids is to designate the first k data points as centroids.) Compare each data point to all
centroids and assign it to the closest cluster, identified using a distance function to compute the distance between points. Recompute the centroids based on the new assignment. The mean of the data points in each cluster serves as the centroid.
Keep repeating steps 2 and 3 above until the cluster assignment (or cluster means) does not change or the maximum number of iterations is reached.
A Practical Guide to Machine Learning with Python
This course teaches you how to code basic machine learning models. The content is designed for beginners with general knowledge of machine learning, including common algorithms such as linear regression, logistic regression, SVM, KNN, decision trees, and more. If you need a refresher, we have summarized key concepts from machine learning, and there are overviews of specific algorithms dispersed throughout the course.
How cluster assignment works in k-means: An example#
The distance from all centroids is computed to assign a data point to its closest cluster, and the closest cluster is decided. One of the most common distance functions used is Euclidean distance:
Where
The
Python code for k-means clustering algorithm#
Here’s the Python code of the
Code explanation:
Below is a line-by-line explanation of the code:
Line 1: The
KMeansclass is imported from thesklearn.clusterpackage.Line 2: The
numpylibrary is imported to initialize a dataset for the program.Line 3: The
matplotlib.pyplotlibrary is imported to visualize the outcomes.Line 5:
Xis initialized as anumpyarray. It contains eight data items with two features each.Line 6: The
KMeansconstructor is configured for=2 and trained on X. The output is stored in the objectclustering.Line 7: Cluster assignment of each data point is extracted from
clusteringand stored inlabels.Line 8: A vector of colors is initialized and stored in
colors.Line 9: An image size for the output plot is declared.
Lines 10–12: Each data item is plotted in a scatter plot with a color corresponding to its cluster.
Density-based clustering algorithm#
When it’s impossible to determine the number of clusters (
In contrast, density-based clustering does not require
DBSCAN requires two key parameters:
eps: The radius that defines the neighborhood of a data point.min_samples: The minimum number of points required to form a cluster.
Data points outside the eps neighborhood and do not form a cluster of at least min_samples points are treated as noisy data points or outliers.
How density-based clustering works in DBSCAN: An example#
Here is a walk-through of the DBSCAN algorithm step by step:
Let’s break down the example step by step. In this case, DBSCAN forms clusters based on Euclidean distance with a predefined threshold eps = 3.
First, DBSCAN identifies points that are close enough to form a cluster. The points (1, 2), (2, 2), and (2, 3) create the first cluster since each pair is within the
epsdistance.Next, (8, 7) and (8, 8) are evaluated. The distance among points is within
eps, so they form a second cluster. However, they remain separate from the first cluster because they are too far from its points.Similarly, (25, 80) and (23, 82) form a third cluster, as they are close to each other but too distant from the other clusters.
Finally, (100, 100) is analyzed. As it is not within
epsof any cluster, it remains isolated. Given thatmin_samples = 2, it does not meet the clustering criteria and is classified as an outlier.
Python code for the DBSCAN clustering algorithm#
Here is the DBSCAN algorithm implemented in the same example:
Code explanation:
Let’s go through the code line by line:
Line 1: We’ve imported the
DBSCANclass from thesklearn.clusterpackage.Line 2: We’ve imported the
numpylibrary to initialize a dataset for the program.Line 3: The
matplotlib.pyplotlibrary is imported to visualize the outcomes.Line 5:
Xhas been initialized as anumpyarray containing eight data items with two features.Line 6: The
DBSCANconstructor is configured foreps=3andmin_samples=2and trained onX. The output is stored in the objectclustering.Line 7: Cluster assignment of each data point is extracted from
clusteringand stored inlabels.Line 8: A vector of colors is initialized and stored in
colors.Line 9: An image size for the output plot is declared.
Lines 10–12: Each data item is plotted in a scatter plot with a color corresponding to its cluster.
Feel free to play with the code of both algorithms (particularly the parameters each algorithm expects) and observe their impact on the output.
Next steps#
By now, you should have a solid grasp of clustering basics. If you’re ready to get hands-on with clustering, check out the Chemical Distillation Using Self-Organizing Maps project, where you’ll group ceramic samples based on their chemical composition and uncover meaningful patterns in the data. It’s a great way to sharpen your clustering skills and take a step closer to becoming a machine learning expert.
To explore machine learning and clustering in greater depth, consider diving into the following courses:
Frequently Asked Questions
What is an example of clustering?
What is an example of clustering?
How is clustering different from classification?
How is clustering different from classification?
What are the common challenges in clustering?
What are the common challenges in clustering?
How do I evaluate the quality of clusters?
How do I evaluate the quality of clusters?
Can clustering handle noisy data?
Can clustering handle noisy data?
Why is parameter tuning important in clustering?
Why is parameter tuning important in clustering?