K-Means Walk-Through Example

Practice the k-means algorithm with a step-by-step walkthrough example in this lesson.

KK-means algorithm

For a given dataset and value of kk, kk-means clustering has the following steps:

  1. Choose some value of kk such that k2k\ge2, if it’s not given already.

  2. Choose kk number of centroids, randomly.

  3. Find the similarity score of each data point with respect to each centroid.

  4. Based on the similarity score, assign each data point its centroid.

  5. From these new groupings, find new centroids by taking the mean of all data points of a cluster.

  6. Repeat steps 3 to 5 until the difference between old and new centroids is negligible.

If the steps above seem unclear, don’t worry. We’re going to show each step in an example with an illustration.

Dry running the example

Let’s say we have the following dataset:

Press + to interact
Actual dataset
Actual dataset

Step 1: Plotting the data

Run the following code widget in order to plot the data. Here, x and y contain all the x and y-coordinates to represent our synthetic data.

Press + to interact
import matplotlib.pyplot as plt
x = [1, 2, 2, 2.5, 3, 4, 4, 5, 5, 5.5, 6, 6, 6, 6.5, 7]
y = [2, 1, 1.5, 3.5, 4, 3.5, 7.5, 6, 7, 2, 1.5, 3, 5.5, 5, 2.5]
# Plotting the actual datapoints
plt.scatter(x, y, color = 'white', edgecolor = 'black')

Let’s start with the first step of kk-means clustering and decide how many clusters we want if the number isn’t given already. Let the number of clusters be three, which means k=3k = 3.

Step 2: Assigning values to centroids

The second step is to assign kk number of centroids with the random value. Since kk is 33, we’ll get three centroids μ1\bold \mu_1, μ2\bold \mu_2, and μ3\bold \mu_3. Also, assign them random values yields:

Press + to interact
Centroids assignment
Centroids assignment

In the following code, Cx and Cy represent x and y coordinates of the centroids:

Press + to interact
import matplotlib.pyplot as plt
x = [1, 2, 2, 2.5, 3, 4, 4, 5, 5, 5.5, 6, 6, 6, 6.5, 7]
y = [2, 1, 1.5, 3.5, 4, 3.5, 7.5, 6, 7, 2, 1.5, 3, 5.5, 5, 2.5]
# Assigning random positions to centroids
Cx = [1, 7, 5]; Cy = [1, 2, 6.5];
colors = ['red', 'blue', 'green']
# Plotting the actual datapoints
plt.scatter(x, y, color = 'white', edgecolor = 'black')
# Plotting centroids
for ctr, clr in zip(range(len(Cx)), colors):
plt.plot(Cx[ctr] , Cy[ctr], color = clr, marker = 's', markersize=10, alpha = 0.2)

Step 3: Calculating the dissimilarity score

The third step is to find the dissimilarity score of each data point (15 total) with each centroid. We’ll be using the Euclidean distance as the dissimilarity score. The function euclidean_distances takes two arrays, where each array is an array of points. Let’s see how to calculate the dissimilarity score using sklearn:

Press + to interact
from sklearn.metrics.pairwise import euclidean_distances as dis_score
x = [1, 2, 2, 2.5, 3, 4, 4, 5, 5, 5.5, 6, 6, 6, 6.5, 7]
y = [2, 1, 1.5, 3.5, 4, 3.5, 7.5, 6, 7, 2, 1.5, 3, 5.5, 5, 2.5]
# Assigning random positions to centroids
Cx = [1, 7, 5]; Cy = [1, 2, 6.5];
# Converting data points and centers into array of arrays for using s_score
data_points = [[x[i], y[i]] for i in range(len(x))]
centers = [[Cx[i], Cy[i]] for i in range(len(Cx))]
print(dis_score(data_points, centers))

Here is the explanation for the code above:

  • Lines 3–7: We define two lists x and y representing the x and y-coordinates of the data points. Similarly, Cx and Cy represent the x and y-coordinates of initial centroids.

  • Line 10: We convert the x and y lists into an array of arrays representing data_points using a list comprehension.

  • Line 13: We use the euclidean_distances function to calculate the Euclidean distances between data points and centroids and print the resulting array.

The code output will be a 2D array where each row represents a data point, and each column represents a centroid. The value at position (i,j)(i, j) in the array represents the Euclidean distance between the ithi^{th} ...