Implementing the k-means algorithm from scratch

K-means clustering is a clustering method that partitions n data points into k clusters such that each data point belongs to the cluster at the shortest distance. It’s a popular unsupervised machine learning algorithm that is widely used in various fields, including data analysis, image processing, and natural language processing.

To recap, the basic step-by-step algorithm for k-means clustering is listed below:

Place k centroids at random locations.
Assign all the data points to the closest centroid.
Compute the new centroids as the mean of all points in the cluster.
Compute the sum of squared errors between new and old centroids.

#import the random function to generate random numbers of a float datatype
import random
#value of k for generating centroid points
k=3
#take the max values of each dimension to generate centroid values
#within the datapoint range in the scatter plot
max_x=max(x)
max_y=max(y)
# X coordinates of random centroids
C_x = [random.uniform(0,max_x) for i in range(k)]
# Y coordinates of random centroids
C_y = [random.uniform(0,max_y) for i in range(k)]
#print the random centroid points in the form of a 2d array
C = np.array(list(zip(C_x, C_y)), dtype=np.float32)
print("Initial Centroids:")
print(C)
#displaying the centroids onto the scatter plot with blue stars for clarity
plt.scatter(x, y, c='black', s=20)
plt.scatter(C_x, C_y, marker='*', s=200, c='b')

Once the three random points are selected as centroids, we display them on the scatter plot we made in the previous step.

Implement utility functions

Before we dive into the main logic of the solution to the k-means algorithm, we need to implement a few helper functions.

The first function we need is one which implements the Euclidean distance metric we need to find the distance between a centroid and a particular data point in the dataset. The Euclidean distance between two points $(x_1,y_1)$ and $(x_2,y_2)$ is given by the formula:

$\sqrt{(y_2^2-y_1^2)+(x_2^2-x_1^2)}$

This is implemented as:

#function called "assign_members" that assigns labels to data points
def assign_members(points, centroids):
    c1 = [] # cluster 1 containing all points that belong to it
    c2 = [] # cluster 2 containing all points that belong to it
    c3 = [] # cluster 3 containing all points that belong to it
    #initalize the centroid points with respect to cluster 1, 2, 3
    X=points
    cluster_labels=[]
    c1pt=centroids[0]
    c2pt=centroids[1]
    c3pt=centroids[2]
    #finde the Euclidean distance of the ith point with three of the cluster centroids
    for i in range(len(X)):
      dist1=euclidean(X[i],c1pt)
      dist2=euclidean(X[i],c2pt)
      dist3=euclidean(X[i],c3pt)
      #the cluster number which had the smallest distance, found by np.argmin, is appended in cluster_labels and that point is added in c1/c2/c3
      lab=np.argmin([dist1,dist2,dist3])
      #indices start from zero in an array so labels start from zero!
      if lab==0: #label 0 corresponding to cluster 1
        c1.append(X[i])
        cluster_labels.append(0)
      elif lab==1: #label 1 corresponding to cluster 2
        c2.append(X[i])
        cluster_labels.append(1)
      else: #label 2 corresponding to cluster 3
        c3.append(X[i])
        cluster_labels.append(2)
    return c1,c2,c3,cluster_labels

The assign_members function assigning a label to data points nearer to a certain centroid

main.py

functions.py

import numpy as np
def computeError(old_centers, new_centers):
  ans=np.subtract(old_centers,new_centers)
  return np.sum(np.square(ans))
def assign_members(points, centroids):
    c1 = [] 
    c2 = [] 
    c3 = [] 
    X=points
    cluster_labels=[]
    c1pt=centroids[0]
    c2pt=centroids[1]
    c3pt=centroids[2]
    for i in range(len(X)):
      dist1=euclidean(X[i],c1pt)
      dist2=euclidean(X[i],c2pt)
      dist3=euclidean(X[i],c3pt)
      lab=np.argmin([dist1,dist2,dist3])
      if lab==0:
        c1.append(X[i])
        cluster_labels.append(0)
      elif lab==1:
        c2.append(X[i])
        cluster_labels.append(1)
      else:
        c3.append(X[i])
        cluster_labels.append(2)
    return c1,c2,c3,cluster_labels
def update_centroids(cluster1, cluster2, cluster3):
    new_c1=np.mean(cluster1,axis=0)
    new_c2=np.mean(cluster2,axis=0)
    new_c3=np.mean(cluster3,axis=0)
    return new_c1,new_c2,new_c3
 
#implementing the euclidean distance formula called "euclidean"
def euclidean(x, y):   
  a=np.subtract(x,y)
  ans=np.square(a)
  r1=np.sum(ans)
  return np.sqrt(r1)

colors = ['r', 'g', 'b']
fig, ax = plt.subplots()
#plot clusters corresponding to each cluster label (0,1 and 2) one by one in a loop
for i in range(k):
        #points corresponding to a cluster label are placed in the points array and plotted in each iteration
        points = np.array([X[j] for j in range(len(X)) if cluster_labels[j] == i])
        ax.scatter(points[:, 0], points[:, 1], s=7, c=colors[i])
#new centroid are shown on the newly formed clusters
ax.scatter(newcentroids[:, 0], newcentroids[:, 1], marker='*', s=200, c='black')

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources

Actual values (Y)	Predicted values (Y*)
120	124
16	13
200	206

Implementing the k-means algorithm from scratch

Step-by-step method

Step 1: Visualize the dataset

Step 2: Generate initial centroid points

Implement utility functions

Step 3: Generate final results

Conclusion