K-medoids, also known as partitioning around medoids (PAM), is a popular clustering algorithm that groups
The k-means clustering algorithm uses centroids. K-medoids is an alternative clustering algorithm that uses medoids instead.
A medoids can be defined as a point in the cluster within a dataset from which the sum of distances to other points is minimal. It is the data point in a cluster characterized by the lowest dissimilarity with other data points.
The k-means algorithm is sensitive to outliers. The k-medoids algorithm, on the other hand, mitigates that sensitivity by eliminating reliance on centroids. The k-medoids algorithm aims to group data points into
The distance between each data point from both medoids is calculated using the Manhattan distance formula. It is also known as the cost.
Select
Assign data points to the cluster of the nearest medoid. It then assigns each non-medoid to the cluster that has the closest medoid.
Calculate the total sum of distances of data points from their assigned medoids for each medoid. It then calculates the cost, i.e., the total sum of distances or dissimilarities of the data points from the assigned medoid.
Swap a non-medoid point with a medoid point and recalculate the cost. It then swaps every non-medoid point with the medoid assigned to it and recalculates the total sum of distances.
Undo the swap if the recalculated cost with the new medoid exceeds the previous cost. Finally, it evaluates whether the recalculated cost is more than the previously calculated cost. If this is the case, it undoes the swap, and the algorithm converges. If the recalculated cost is less, it repeats step 4.
After the algorithm completes, we will have
Let's look at the following example demonstrating the k-medoid algorithm.
Let's take the data given below. We want to divide the data into clusters with
No. | X | Y |
1 | 9 | 3 |
2 | 8 | 4 |
3 | 4 | 6 |
4 | 8 | 5 |
5 | 2 | 5 |
6 | 3 | 8 |
7 | 5 | 8 |
8 | 4 | 4 |
9 | 10 | 4 |
10 | 9 | 6 |
When we plot the data, it looks like this:
Pick two random medoids. In this case, we pick
We calculate the distance between each data point from both medoids using the Manhattan distance formula:
No | X | Y | Distance from medoid m1 (4,6) | Distance from medoid m2 (9,3) |
1 | 9 | 3 | - | - |
2 | 8 | 4 | 6 | 2 |
3 | 4 | 6 | - | - |
4 | 8 | 5 | 5 | 3 |
5 | 2 | 5 | 3 | 9 |
6 | 3 | 8 | 3 | 11 |
7 | 5 | 8 | 3 | 9 |
8 | 4 | 4 | 2 | 6 |
9 | 10 | 4 | 8 | 2 |
10 | 9 | 6 | 5 | 3 |
All the points are assigned to the cluster of the medoid where the dissimilarity is the least. Here points 5, 6, 7, and 8 are assigned to
Now, randomly select another medoid and recalculate the distance. The chosen new medoid is
No | X | Y | Distance from medoid m1 (2,5) | Distance from medoid m2 (9,3) |
1 | 9 | 3 | - | - |
2 | 8 | 4 | 7 | 2 |
3 | 4 | 6 | 3 | 8 |
4 | 8 | 5 | 6 | 3 |
5 | 2 | 5 | - | - |
6 | 3 | 8 | 4 | 11 |
7 | 5 | 8 | 6 | 9 |
8 | 4 | 4 | 3 | 6 |
9 | 10 | 4 | 9 | 2 |
10 | 9 | 6 | 8 | 3 |
Here points 3, 6, 7, and 8 are assigned to
Since SwapCost is five, which is larger than 0, we undo the swap.
The final medoids are
An extension of scikit-learn used to implement advanced algorithms is scikit-learn-extra
. Install scikit-learn and import necessary libraries.
!pip install scikit-learn-extrafrom sklearn_extra.cluster import KMedoidsimport numpy as np
Initialize data points:
np.asarray([[6, 9], [4, 10], [4, 4], [8, 5], [8, 3], [5, 2], [5, 8], [6, 4], [4, 8], [3, 9]])
We can utilize the KMedoids
algorithm from scikit-learn-extra
to fit the provided data.
k=2kmedoids = KMedoids(n_clusters=k).fit(data)
Print the final clusters and their medoids:
clusters = kmedoids.cluster_centers_medoids_final = data[kmedoids.medoid_indices_]print(medoids_final)print(clusters)
The final clusters will be output:
Cluster 0:
Cluster 1:
The final medoids will be
The k-medoids algorithm is quick and converges after a fixed amount of steps. It's simple and easy to implement and understand. It also addresses the limitations that come with the k-means algorithm.
The k-medoids algorithm isn't optimal nor suitable for clustering where non-spherical groups or objects are concerned. The reason is that it hugely depends on minimizing the distance between the non-medoid and medoid points. Another disadvantage is that since the first
K-means is one of the most powerful clustering algorithms. However, it has limitations, including sensitivity towards outliers that could impact the formed clusters. On the other hand, k-medoids is a powerful clustering tool with several benefits over other clustering algorithms. Its ability to group data based on their similarity proves to be a valuable asset for pattern recognition and data analysis.