Implementing the k-means algorithm from scratch

K-means clustering is a clustering method that partitions n data points into k clusters such that each data point belongs to the cluster at the shortest distance. It’s a popular unsupervised machine learning algorithm that is widely used in various fields, including data analysis, image processing, and natural language processing.

To recap, the basic step-by-step algorithm for k-means clustering is listed below:

  1. Place k centroids at random locations.

  2. Assign all the data points to the closest centroid.

  3. Compute the new centroids as the mean of all points in the cluster.

  4. Compute the sum of squared errors between new and old centroids.

Note: To read more about the k-means algorithm, check out this link.

Step-by-step method

We can implement this algorithm using Python by following a series of steps that are outlined below.

Step 1: Visualize the dataset

To start off, we load a basic two-dimensional dataset in the form of a csv file via a pandas dataframe. Next, we display the data points in the form of a scatter plot, as it’s the preferred option to visualize two-dimensional data points. The code to do this is shown below:

#importing necessary libraries
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import math
#loading the csv dataset with the read_csv function
df=pd.read_csv('/mydataset.csv',header=None)
#the x-dimension of the dataset
x = df[0]
#the y-dimension of the dataset
y = df[1]
#displaying each dimension of points with a scatter plot using matplotlib
plt.scatter(x, y, c='black', s=20)

Step 2: Generate initial centroid points

From the scatter plot in the previous step, we can see that three clusters are made from the dataset. Therefore, we choose the value of k to be three in this case.

Note: We can optimize this value of k to any value we desire for our problem.

Next, we generate initial centroids for three of the clusters in our dataset via the random.uniform function:

#import the random function to generate random numbers of a float datatype
import random
#value of k for generating centroid points
k=3
#take the max values of each dimension to generate centroid values
#within the datapoint range in the scatter plot
max_x=max(x)
max_y=max(y)
# X coordinates of random centroids
C_x = [random.uniform(0,max_x) for i in range(k)]
# Y coordinates of random centroids
C_y = [random.uniform(0,max_y) for i in range(k)]
#print the random centroid points in the form of a 2d array
C = np.array(list(zip(C_x, C_y)), dtype=np.float32)
print("Initial Centroids:")
print(C)
#displaying the centroids onto the scatter plot with blue stars for clarity
plt.scatter(x, y, c='black', s=20)
plt.scatter(C_x, C_y, marker='*', s=200, c='b')

Once the three random points are selected as centroids, we display them on the scatter plot we made in the previous step.

Implement utility functions

Before we dive into the main logic of the solution to the k-means algorithm, we need to implement a few helper functions.

The first function we need is one which implements the Euclidean distance metric we need to find the distance between a centroid and a particular data point in the dataset. The Euclidean distance between two points (x1,y1)(x_1,y_1) and (x2,y2)(x_2,y_2) is given by the formula:

                  (y22y12)+(x22x12)\sqrt{(y_2^2-y_1^2)+(x_2^2-x_1^2)}

This is implemented as:

#implement the Euclidean distance formula called "euclidean"
def euclidean(x, y):
a=np.subtract(x,y)
ans=np.square(a)
r1=np.sum(ans)
return np.sqrt(r1)
The euclidean distance function

Next, we implement the function that will assign a label to the data point that is nearest to any of the three clusters. Let’s take a small example to understand this function further.

canvasAnimation-image
1 of 2

The code that implements this function is shown below.

#function called "assign_members" that assigns labels to data points
def assign_members(points, centroids):
c1 = [] # cluster 1 containing all points that belong to it
c2 = [] # cluster 2 containing all points that belong to it
c3 = [] # cluster 3 containing all points that belong to it
#initalize the centroid points with respect to cluster 1, 2, 3
X=points
cluster_labels=[]
c1pt=centroids[0]
c2pt=centroids[1]
c3pt=centroids[2]
#finde the Euclidean distance of the ith point with three of the cluster centroids
for i in range(len(X)):
dist1=euclidean(X[i],c1pt)
dist2=euclidean(X[i],c2pt)
dist3=euclidean(X[i],c3pt)
#the cluster number which had the smallest distance, found by np.argmin, is appended in cluster_labels and that point is added in c1/c2/c3
lab=np.argmin([dist1,dist2,dist3])
#indices start from zero in an array so labels start from zero!
if lab==0: #label 0 corresponding to cluster 1
c1.append(X[i])
cluster_labels.append(0)
elif lab==1: #label 1 corresponding to cluster 2
c2.append(X[i])
cluster_labels.append(1)
else: #label 2 corresponding to cluster 3
c3.append(X[i])
cluster_labels.append(2)
return c1,c2,c3,cluster_labels
The assign_members function assigning a label to data points nearer to a certain centroid

Next, we create a update_centroids function that will take the mean of the data points that have been sorted into each cluster and will return a new centroid point for each cluster.

def update_centroids(cluster1, cluster2, cluster3):
#take the mean of all of the points to get the new centroid for c1, c2, c3 clusters
new_c1=np.mean(cluster1,axis=0)
new_c2=np.mean(cluster2,axis=0)
new_c3=np.mean(cluster3,axis=0)
return new_c1,new_c2,new_c3
Function for updating centroids

Finally comes the part where we compute the error between the new centroids made and the old centroids we initialized in the previous step. This is done via the computeError function, which is shown below.

#compute the squared sum error between the old centroids and new centroids made after updating them
#first we take the difference of these points, then square the differences and add them up
def computeError(old_centers, new_centers):
ans=np.subtract(old_centers,new_centers)
return np.sum(np.square(ans))
Function computing the squared sum error

The formula for computing the sum of squares error is:

Where YiY_i is the real value, Yi^\hat{Y_i} is the predicted value, and n is the number of samples in the dataset. Let’s understand this formula with a small example:

Actual values (Y)

Predicted values (Y*)

120

124

16

13

200

206

In the table with some sample data above, we plug every pair of actual values (YY) and predicted values (YY^*) into the formula to get:

We can deduce that the sum of squares error for the tabulated data is 61.

Step 3: Generate final results

Lastly, we run multiple iterations of the algorithm in a loop, performing the steps explained above. We stop executing the algorithm when the sum of squares error (error) becomes zero. Run the code below to see the algorithm in action!

main.py
functions.py
import numpy as np
def computeError(old_centers, new_centers):
ans=np.subtract(old_centers,new_centers)
return np.sum(np.square(ans))
def assign_members(points, centroids):
c1 = []
c2 = []
c3 = []
X=points
cluster_labels=[]
c1pt=centroids[0]
c2pt=centroids[1]
c3pt=centroids[2]
for i in range(len(X)):
dist1=euclidean(X[i],c1pt)
dist2=euclidean(X[i],c2pt)
dist3=euclidean(X[i],c3pt)
lab=np.argmin([dist1,dist2,dist3])
if lab==0:
c1.append(X[i])
cluster_labels.append(0)
elif lab==1:
c2.append(X[i])
cluster_labels.append(1)
else:
c3.append(X[i])
cluster_labels.append(2)
return c1,c2,c3,cluster_labels
def update_centroids(cluster1, cluster2, cluster3):
new_c1=np.mean(cluster1,axis=0)
new_c2=np.mean(cluster2,axis=0)
new_c3=np.mean(cluster3,axis=0)
return new_c1,new_c2,new_c3
#implementing the euclidean distance formula called "euclidean"
def euclidean(x, y):
a=np.subtract(x,y)
ans=np.square(a)
r1=np.sum(ans)
return np.sqrt(r1)

Once the algorithm stops executing, we display the scatter plot with the newly formed clusters and their updated centroids, as shown in the code below.

colors = ['r', 'g', 'b']
fig, ax = plt.subplots()
#plot clusters corresponding to each cluster label (0,1 and 2) one by one in a loop
for i in range(k):
#points corresponding to a cluster label are placed in the points array and plotted in each iteration
points = np.array([X[j] for j in range(len(X)) if cluster_labels[j] == i])
ax.scatter(points[:, 0], points[:, 1], s=7, c=colors[i])
#new centroid are shown on the newly formed clusters
ax.scatter(newcentroids[:, 0], newcentroids[:, 1], marker='*', s=200, c='black')

Conclusion

A key takeaway from implementing this algorithm would be that it’s sensitive to the initial position of the centroids. Different initializations may result in different cluster assignments. Thus, it’s common to run the algorithm multiple times with various initial centroids and choose the best result, as shown below:

canvasAnimation-image
1 of 2

In addition, k-means can converge to a local minimum (a situation where the algorithm has found a solution that is the best within a small, local region of the solution space but may not necessarily be the globally optimal solution), so it might not always find the best clustering solution. Using alternative methods like hierarchical clustering or DBSCAN can mitigate this issue.

Copyright ©2024 Educative, Inc. All rights reserved