K-Means Walk-Through Example
Practice the k-means algorithm with a step-by-step walkthrough example in this lesson.
-means algorithm
For a given dataset and value of , -means clustering has the following steps:
-
Choose some value of such that , if it’s not given already.
-
Choose number of centroids, randomly.
-
Find the similarity score of each data point with respect to each centroid.
-
Based on the similarity score, assign each data point its centroid.
-
From these new groupings, find new centroids by taking the mean of all data points of a cluster.
-
Repeat steps 3 to 5 until the difference between old and new centroids is negligible.
If the steps above seem unclear, don’t worry. We’re going to show each step in an example with an illustration.
Dry running the example
Let’s say we have the following dataset:
Step 1: Plotting the data
Run the following code widget in order to plot the data. Here, x
and y
contain all the x and y-coordinates to represent our synthetic data.
import matplotlib.pyplot as pltx = [1, 2, 2, 2.5, 3, 4, 4, 5, 5, 5.5, 6, 6, 6, 6.5, 7]y = [2, 1, 1.5, 3.5, 4, 3.5, 7.5, 6, 7, 2, 1.5, 3, 5.5, 5, 2.5]# Plotting the actual datapointsplt.scatter(x, y, color = 'white', edgecolor = 'black')plt.show()
Let’s start with the first step of -means clustering and decide how many clusters we want if the number isn’t given already. Let the number of clusters be three, which means .
Step 2: Assigning values to centroids
The second step is to assign number of centroids with the random value. Since is , we’ll get three centroids , , and . Also, assign them random values yields:
In the following code, Cx
and Cy
represent x and y coordinates of the centroids:
import matplotlib.pyplot as pltx = [1, 2, 2, 2.5, 3, 4, 4, 5, 5, 5.5, 6, 6, 6, 6.5, 7]y = [2, 1, 1.5, 3.5, 4, 3.5, 7.5, 6, 7, 2, 1.5, 3, 5.5, 5, 2.5]# Assigning random positions to centroidsCx = [1, 7, 5]; Cy = [1, 2, 6.5];colors = ['red', 'blue', 'green']# Plotting the actual datapointsplt.scatter(x, y, color = 'white', edgecolor = 'black')# Plotting centroidsfor ctr, clr in zip(range(len(Cx)), colors):plt.plot(Cx[ctr] , Cy[ctr], color = clr, marker = 's', markersize=10, alpha = 0.2)plt.show()
Step 3: Calculating the dissimilarity score
The third step is to find the dissimilarity score of each data point (15 total) with each centroid. We’ll be using the Euclidean distance as the dissimilarity score. The function euclidean_distances
takes two arrays, where each array is an array of points. Let’s see how to calculate the dissimilarity score using sklearn
:
from sklearn.metrics.pairwise import euclidean_distances as dis_scorex = [1, 2, 2, 2.5, 3, 4, 4, 5, 5, 5.5, 6, 6, 6, 6.5, 7]y = [2, 1, 1.5, 3.5, 4, 3.5, 7.5, 6, 7, 2, 1.5, 3, 5.5, 5, 2.5]# Assigning random positions to centroidsCx = [1, 7, 5]; Cy = [1, 2, 6.5];# Converting data points and centers into array of arrays for using s_scoredata_points = [[x[i], y[i]] for i in range(len(x))]centers = [[Cx[i], Cy[i]] for i in range(len(Cx))]print(dis_score(data_points, centers))
Here is the explanation for the code above:
-
Lines 3–7: We define two lists
x
andy
representing the x and y-coordinates of the data points. Similarly,Cx
andCy
represent the x and y-coordinates of initial centroids. -
Line 10: We convert the
x
andy
lists into an array of arrays representingdata_points
using a list comprehension. -
Line 13: We use the
euclidean_distances
function to calculate the Euclidean distances between data points and centroids and print the resulting array.
The code output will be a 2D array where each row represents a data point, and each column represents a centroid. The value at position in the array represents the Euclidean distance between the ...