Introduction to Clustering
Explore clustering algorithms and their limitations with sample datasets and similarity functions.
In life, we try to group many things to understand them better or even simplify them. Let’s say you and your friend are trying to classify video games. Your friend might want to classify them based on genre and end up with a collection of all the video games in neat little clusters. While classifying video games by their developers, you might have a different collection from your friend’s collection.
> Note: Both collections originated from the same data pool (video games), and both have learned something interesting from the data.
In machine learning, we do something similar. We ask the machine to group the data to get meaningful information. This grouping of unlabeled data is called clustering. This grouping is based on similarity among data points. After clustering, the data points should be similar within clusters and dissimilar across clusters.
We can see a lot of data points below:
If we treat the distance between points as dissimilarity then we can easily see that there are three different clusters. For a machine, this requires a robust algorithmic approach to find these clusters.
In the illustration above, we have a lot of data. But this data is spread out elegantly, marking clear boundaries and present in 2D. In the real world, the data is hardly this clean, and beyond three features, showing it graphically becomes near impossible.
Note: Clustering algorithms rely highly on the similarity/dissimilarity functions they use.
What are similarity and dissimilarity functions?
A similarity function is responsible for finding the magnitude of the shared similarity between any pair of data points. It’s a numerical measure of how alike two objects are. This means that whenever we have two data points, we can find how similar they are to each other. A dissimilarity function, on the other hand, computes the dissimilarity between any pair of data points. It’s a numerical measure of how different two data objects are.
Euclidean distance
The Euclidean distance between two points ...