In this lesson, we’re going to discuss the first clustering approach, partitional clustering, and its three subtypes: density, centroid, and distribution-based partitioning. This approach aims to partition the data into disjoint subsets, each of which makes up a cluster. Methods adopting this approach include k-means, fuzzy c-means, DBSCAN, and model-based cluster learning using the expectation-maximization (EM) approach.

K-means

The k-mean’s goal is to find a set of centroids, each of which represents the center of a cluster and is characterized by the mean vectors of data points belonging to that cluster. Based on the provided distance metric, each data point is then assigned to the centroid closest to it.

Assumptions and target cluster type

The k-means algorithm searches for clusters that are spherical in shape. Note that this algorithm will not perform well in cases where we have elongated or intertwined clusters, such as those appearing as spiral and nested circles in the figure below. K-means expects data in the form of continuous, numeric vectors since it uses the mean vector (that is, computed on all attributes) as the representation of clusters.

Method

The inputs of K-means are:

  • The input dataset, DD
  • (optional) The distance function d(x,y)d(x, y) with xx, yy being data points in DD
  • The number of clusters, KK

The second input is optional because, by default, the K-means implementation uses Euclidean distance. In theory, K-means can work with other metrics as well, for instance, the use of kernel in Kernel K-means.

K-means follows the following steps mentioned below.

Step 1: Initialization

Select KK initial centroids, represented as mean vectors c1,c2,...,cKc_1 , c_2 , . . . , c_K . These centroids can be selected randomly or with prior knowledge.

Step 2: Cluster assignment

Each data point in DD is assigned to the centroid closest to it. The distance is computed using the distance function dd, which is the Euclidean distance by default.

Step 3: Centroid update

Cluster centroids are recomputed to be the means:

cj=1DjxiiDjc_j=\frac {1}{|D_j|} \sum_{x_i∈_iD_j}

where

  • cjc_j is the mean vector of cluster jj,
  • DjDD_j ⊂ D is the data subset containing all data points that are currently assigned to cluster jj, and
  • xiDjx_i ∈ D_j are data points assigned to cluster jj.

Step 4: Cluster reassignment

Each data point in DD is assigned to the centroid closest to it.

Step 5: Stability checking and termination

If there’s no change in the cluster assignment, the algorithm terminates. Otherwise, we go back to Step 3.

Hyperparameter tuning

K-means expects the number of clusters KK as input, but we often don’t know KK apriori. As such, KK needs to be tuned, starting with a small KK up to a large one. Cluster evaluation metrics will be used to assess the quality of the clusters, and the Elbow method is used to select the best KK

A practical example of applying k-means

In lab, we’ll apply k-means to cluster players from the Dotalicious dataset. We’ll first need to load and preprocess the data. Next, since k-means assumes clusters of spherical shape, it’s important to scale or normalize the data, following the process outlined previously, to put the variables into comparable scales. We may want to see if there’s any correlation among the attributes to consolidate the features. To do so, please follow the steps discussed in the Introduction to Probability and Statistics chapter and in Data Abstraction chapter. It’s important to do this because we don’t want to introduce bias into the distance computation. In general, using dimension reduction techniques, such as PCA, is useful to eliminate such correlations in our data before running cluster analysis.

Selecting clusters

Next, as discussed earlier, to select an appropriate number of clusters, we use a scree plot to examine how the WSS metric changes with the number of clusters (see the figure below). We can observe clearly that the improvement on WSS slows down after K=8K = 8, with K=4K = 4 and K=8K = 8 being the two elbows of the plot. We can also use the silhouette width to see how cohesive the clusters are, given the number of clusters chosen.

Get hands-on with 1400+ tech skills courses.