K-Means Walk-Through Example
Practice the k-means algorithm with a step-by-step walkthrough example in this lesson.
We'll cover the following...
-means algorithm
For a given dataset and value of , -means clustering has the following steps:
-
Choose some value of such that , if it’s not given already.
-
Choose number of centroids, randomly.
-
Find the similarity score of each data point with respect to each centroid.
-
Based on the similarity score, assign each data point its centroid.
-
From these new groupings, find new centroids by taking the mean of all data points of a cluster.
-
Repeat steps 3 to 5 until the difference between old and new centroids is negligible.
If the steps above seem unclear, don’t worry. We’re going to show each step in an example with an illustration.
Dry running the example
Let’s say we have the following dataset:
Step 1: Plotting the data
Run the following code widget in order to plot the data. Here, x and y contain all the x and y-coordinates to represent our synthetic data.
Let’s start with the first step of -means clustering and decide how many clusters we want if the number isn’t given already. Let the number of clusters be three, which means .
Step 2: Assigning values to centroids
The second step is to assign number of centroids with the random value. Since is , we’ll get three centroids , , and . Also, assign them random values yields:
In the following code, Cx and Cy represent x and y coordinates of the centroids:
Step 3: Calculating the dissimilarity score
The third step is to find the dissimilarity score of each data point (15 total) with each centroid. We’ll be using the Euclidean distance as the dissimilarity score. The function euclidean_distances takes two arrays, where each array is an array of points. Let’s see how to calculate the dissimilarity score using sklearn:
Here is the explanation for the code above:
-
Lines 3–7: We define two lists
xandyrepresenting the x and y-coordinates of the data points. Similarly,CxandCyrepresent the x and y-coordinates of initial centroids. -
Line 10: We convert the
xandylists into an array of arrays representingdata_pointsusing a list comprehension. -
Line 13: We use the
euclidean_distancesfunction to calculate the Euclidean distances between data points and centroids and print the resulting array.
The code output will be a 2D array where each row represents a data point, and each column represents a centroid. The value at position in the array represents the Euclidean distance between the data point and the centroid.
The dissimilarity scores were calculated using sklearn and are also given below:
Dissimilarity Scores
Data Points | Centroid_1 (1, 1) | Centroid_2 (7, 2) | Centroid_3 (5, 6.5) |
1, 2 | 1 | 6 | 6.020797289 |
2, 1 | 1 | 5.099019514 | 6.264982043 |
2, 1.5 | 1.118033989 | 5.024937811 | 5.830951895 |
2.5, 3.5 | 2.915475947 | 4.74341649 | 3.905124838 |
3, 4 | 3.605551275 | 4.472135955 | 3.201562119 |
4, 3.5 | 3.905124838 | 3.354101966 | 3.16227766 |
4, 7.5 | 7.158910532 | 6.264982043 | 1.414213562 |
5, 6 | 6.403124237 | 4.472135955 | 0.5 |
5, 7 | 7.211102551 | 5.385164807 | 0.5 |
5.5, 2 | 4.609772229 | 1.5 | 4.527692569 |
6, 1.5 | 5.024937811 | 1.118033989 | 5.099019514 |
6, 3 | 5.385164807 | 1.414213562 | 3.640054945 |
6, 5.5 | 6.726812024 | 3.640054945 | 1.414213562 |
6.5, 5 | 6.800735254 | 3.041381265 | 2.121320344 |
7, 2.5 | 6.184658438 | 0.5 | 4.472135955 |
Now, don’t panic seeing the giant table. All this table tells is the distance of each data point from each centroid. For example, let’s see the first data point ...