Clustering is a data modeling technique that is used to group objects based on multiple features that may be common amongst items present in a data set. We can then analyze these clusters and obtain significant information from them.
Hierarchical clustering is one of the various types of cluster algorithms.
There are two types of hierarchical clustering algorithms:
The Agglomerative Nesting (AGNES) is a convergent approach. It starts off by assigning each of the data points to a cluster of its own. After this dissimilarity is calculated amongst each of the clusters, the clusters with the least dissimilarity are merged together. Eventually, all the nodes end up in just one cluster as shown in the figure above.
Inverting the order of Agglomerative hierarchical analysis, gives birth to Divisive Analysis (DIANA), which is a divergent approach. The algorithm starts off with just one cluster and eventually each node ends up with a cluster of its own.
The following measures can be used to compute dissimilarity between the clusters to merge them:
# importing librariesfrom sklearn.cluster import AgglomerativeClusteringimport numpy as npimport matplotlib.pyplot as plt# initializing sample dataX = np.array([[1, 4], [1, 5], [1, 8], [6, 3], [9, 2], [1, 6]])print('Dataset: ')print(X)# loading the clustering algorithmmodel = AgglomerativeClustering(n_clusters = 2, affinity = 'euclidean')# fitting the datamodel = model.fit(X)# printing the assigned cluster labelsprint('\nThe labels assigned to the train data are: ')print(model.labels_)# using the model to predict labels on test dataY = ([1,2], [2,3], [5,5], [6,0])print('\nTest data: ')print(Y)print('\nThe labels assigned to test data are: ')print(model.fit_predict(Y))
In the code above, we have clustered the NumPy array labeled X
, which is a 2D-array of size 6x2 (that represents our data).
We use the scikit-learn machine learning library for python to do the clustering.
First, we import the required python modules and libraries.
Next, we load the clustering model and specify the arguments.
n_clusters
: This is the number of clusters to form, and hence the number of centroids to produce.affinity
: This defines the similarity metric that is to be used. After the model has been set up, it is run on the train data. We assign labels to each item in the train data set.
After training the model the model is run on the test data set using the fit
and predict
functions, each point from the data is assigned to a cluster. We can print the labels to see which point was assigned to which of their respective clusters.