What is the k-medoids algorithm?

K-medoids

K-medoids, also known as partitioning around medoids (PAM), is a popular clustering algorithm that groups kk data points into clusters by selecting kk representative objects within a dataset. Clustering is a robust unsupervised machine-learning algorithm that establishes patterns by identifying clusters or groups of data points with similar characteristics within a specific dataset.

The k-means clustering algorithm uses centroids. K-medoids is an alternative clustering algorithm that uses medoids instead.

Medoids

A medoids can be defined as a point in the cluster within a dataset from which the sum of distances to other points is minimal. It is the data point in a cluster characterized by the lowest dissimilarity with other data points.

The k-means algorithm is sensitive to outliers. The k-medoids algorithm, on the other hand, mitigates that sensitivity by eliminating reliance on centroids. The k-medoids algorithm aims to group data points into kk clusters, where each data point is assigned to a medoid, and the sum of distances between data points and their assigned medoid is minimized. The algorithm iteratively assigns each data point to the closest medoid and swaps the medoid of each cluster until convergence.

Medoid vs mean in a 2D space
Medoid vs mean in a 2D space

Manhattan distance

The distance between each data point from both medoids is calculated using the Manhattan distance formula. It is also known as the cost.

How the k-medoids algorithm works

  1. Select kk random points from the dataset. First, the algorithm selects kk random points from the dataset as the initial medoids. The medoids that are chosen are used to define the initial kk clusters.

  2. Assign data points to the cluster of the nearest medoid. It then assigns each non-medoid to the cluster that has the closest medoid.

  3. Calculate the total sum of distances of data points from their assigned medoids for each medoid. It then calculates the cost, i.e., the total sum of distances or dissimilarities of the data points from the assigned medoid.

  4. Swap a non-medoid point with a medoid point and recalculate the cost. It then swaps every non-medoid point with the medoid assigned to it and recalculates the total sum of distances.

  5. Undo the swap if the recalculated cost with the new medoid exceeds the previous cost. Finally, it evaluates whether the recalculated cost is more than the previously calculated cost. If this is the case, it undoes the swap, and the algorithm converges. If the recalculated cost is less, it repeats step 4.

After the algorithm completes, we will have kk medoid points with their clusters.

Let's look at the following example demonstrating the k-medoid algorithm.

Example


Let's take the data given below. We want to divide the data into clusters with k=2k=2.

Data points

No.

X

Y

1

9

3

2

8

4

3

4

6

4

8

5

5

2

5

6

3

8

7

5

8

8

4

4

9

10

4

10

9

6


When we plot the data, it looks like this:

Graph showing data points before clustering
Graph showing data points before clustering

Pick two random medoids. In this case, we pick (4,6)(4,6) and (9,3)(9,3).
We calculate the distance between each data point from both medoids using the Manhattan distance formula:

Distance from randomly selected medoids

No

X

Y

Distance from medoid m1 (4,6)

Distance from medoid m2

(9,3)

1

9

3

-

-

2

8

4

6

2

3

4

6

-

-

4

8

5

5

3

5

2

5

3

9

6

3

8

3

11

7

5

8

3

9

8

4

4

2

6

9

10

4

8

2

10

9

6

5

3

All the points are assigned to the cluster of the medoid where the dissimilarity is the least. Here points 5, 6, 7, and 8 are assigned to m1m1, while points 2, 4, 9, and 10 are assigned to m2m2.


Now, randomly select another medoid and recalculate the distance. The chosen new medoid is (2,5)(2,5). Swap it with (4,6)(4,6), and the distance is calculated as follows:

Distance from new medoid

No

X

Y

Distance from medoid m1 (2,5)

Distance from medoid m2

(9,3)

1

9

3

-

-

2

8

4

7

2

3

4

6

3

8

4

8

5

6

3

5

2

5

-

-

6

3

8

4

11

7

5

8

6

9

8

4

4

3

6

9

10

4

9

2

10

9

6

8

3

Here points 3, 6, 7, and 8 are assigned to m1m1, while points 2, 4, 9, and 10 are assigned to m2m2.


SwapCost=RecalculatedCostPreviousCostSwap Cost = Recalculated Cost - Previous Cost
SwapCost=2621=5Swap Cost = 26 - 21 = 5


Since SwapCost is five, which is larger than 0, we undo the swap.
The final medoids are m1(4,6)m1 (4,6) and m2(9,3)m2 (9,3). The two clusters are then formed using these selected medoids. The green dots are a cluster with medoid m1m1, and the blue dots are a cluster with medoid m2m2.

Graph showing data points after clustering
Graph showing data points after clustering

Python implementation

  • An extension of scikit-learn used to implement advanced algorithms is scikit-learn-extra. Install scikit-learn and import necessary libraries.

!pip install scikit-learn-extra
from sklearn_extra.cluster import KMedoids
import numpy as np
  • Initialize data points:

np.asarray([[6, 9], [4, 10], [4, 4], [8, 5], [8, 3], [5, 2], [5, 8], [6, 4], [4, 8], [3, 9]])
  • We can utilize the KMedoids algorithm from scikit-learn-extra to fit the provided data.

k=2
kmedoids = KMedoids(n_clusters=k).fit(data)
  • Print the final clusters and their medoids:

clusters = kmedoids.cluster_centers_
medoids_final = data[kmedoids.medoid_indices_]
print(medoids_final)
print(clusters)
  • The final clusters will be output:

Cluster 0:
[[6,9],[8,5],[8,3],[6,4],[3,9]][[6, 9], [8, 5], [8, 3], [6, 4], [3, 9]]

Cluster 1:
[[4,10],[4,4],[5,2],[5,8],[4,8]][[4, 10], [4, 4], [5, 2], [5, 8], [4, 8]]

  • The final medoids will be (6,9)(6, 9) and (4,4)(4, 4).

Advantages

The k-medoids algorithm is quick and converges after a fixed amount of steps. It's simple and easy to implement and understand. It also addresses the limitations that come with the k-means algorithm.

Disadvantages

The k-medoids algorithm isn't optimal nor suitable for clustering where non-spherical groups or objects are concerned. The reason is that it hugely depends on minimizing the distance between the non-medoid and medoid points. Another disadvantage is that since the first kk medoids are chosen randomly, the outcome may differ on different runs on the same dataset

Conclusion

K-means is one of the most powerful clustering algorithms. However, it has limitations, including sensitivity towards outliers that could impact the formed clusters. On the other hand, k-medoids is a powerful clustering tool with several benefits over other clustering algorithms. Its ability to group data based on their similarity proves to be a valuable asset for pattern recognition and data analysis.


Copyright ©2024 Educative, Inc. All rights reserved