Implementing DBSCAN algorithm with sklearn

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a versatile clustering algorithm used in data mining and machine learning. It is known to be effective at identifying clusters of data points in a dataset that have varying shapes, sizes, and densities. Unlike some algorithms, such as the k-means algorithm, DBSCAN does not require the user to specify the number of clusters in advance. This makes it well-suited for data with an unknown number of clusters.

Note: To read more about the DBSCAN algorithm, check out this answer.

Step-by-step method

To implement DBSCAN on a given dataset, we will follow a couple of steps that are outlined below.

Step 1: Visualize the dataset

To start off, we will load the dataset required to use the DBSCAN algorithm. This is done via the pandas library, which has the read_csv() function. To keep things simple, we’ll construct a two-dimensional dataset for easier visualization afterward. After this, we’ll plot a scatter plot of the loaded dataset to visualize the distribution of the datapoints on the scatter plot.

#import the necessary libaries for this step
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#read the csv dataset to store in the dataset variable
dataset = pd.read_csv('/Employee.csv')
#displaying the first 5 rows of the dataset (optional)
print(dataset.head(5))
#select the two columns from the dataset to make the data 2D
X = np.array([dataset['Income'],dataset['Expenditure Score(1-100)']])
#taking transpose to correct the dimension of the 2d points for plotting the scatter plot
X=np.transpose(X)
#label the axes
plt.xlabel('Income')
plt.ylabel('Expenditure Score')
#plotting the scatter plot of the 2D data
plt.scatter(X[:,0], X[:,1], c='black', s=20)

In the scatter plot generated above, we can see that we have taken a dataset in the form of a CSV file named as Employee.csv from which two columns are used, namely “Income” and “Expenditure Score (1-100).”

Note: We can use more columns if we want, but our data will become multidimensional making it harder to visualize and process later on.

Step 2: Generate predictions

Next, we’ll use the DBSCAN.fit() method that is part of the DBSCAN library (from sklearn) to generate the labels for every datapoint to create new clusters. For this example, we have taken the value of the epsilon (eps) parameter to be three, the value of minimum samples per cluster (min_sample) to be four with an Euclidean distance metric.

from sklearn.cluster import DBSCAN
#fit the datapoints onto the DBSCAN algorithm
dbscan = DBSCAN(eps=3, min_samples=4, metric='euclidean').fit(X)
#extracting the new labels of the datapoints
labels=dbscan.fit_predict(X)
ans = dbscan.labels_
#display the predicted labels of the datapoints
print('Predicted labels:')
print(ans)
#extract the set of possible labels from the predicted label list
listlab=list(set(ans))
print("Unique labels:")
print(listlab)

Note: Using different values of the parameter epsilon, min_samples, and metric will result in different clusters being formed. For instance, by increasing the value of the min_samples parameter, every new cluster would have more points inside it. Conversely, there will be less datapoints inside each cluster.

Here, we can see that the total number of unique labels generated by the DBSCAN algorithm turns out to be ten, with the outlier label denoted as -1. This means that there will be a total of nine clusters, excluding the outliers in our final scatter plot in the coming step.

Step 3: Plot the results

Lastly, we plot the datapoints with respect to their new cluster labels and centroids via a scatter plot. The code is given below:

#list of different colors to differentiate between different clusters as there might be many clusters form as a result of the DBSCAN algorithm
colors = ["red","green","blue","yellow","black","orange","purple","brown","gray","magenta"]
fig, ax = plt.subplots()
centroids=[]
#plot scatter plot with outliers (with -1 label), with respect to each label one by one in a loop
for i in range(len(listlab)):
#points corresponding to a cluster label are placed in the points array and plotted in each iteration
points = np.array([X[j] for j in range(len(X)) if labels[j] == listlab[i]])
#centroids for these points are generated by taking the mean of the points array in each iteration
centroids.append(list(np.mean(points, axis=0)))
ax.scatter(points[:,0], points[:,1], s=10, c=colors[i])
#centroids of the newly formed clusters are plotted as asterisks
centroids=np.array(centroids)
ax.scatter(centroids[:, 0], centroids[:, 1], marker='*', s=200, c='black')
plt.show()

The line-by-line explanation of the code is given below:

  • Line 2: We initialize a list of different colors to differentiate between different clusters that will be formed as a result of this algorithm

  • Lines 3–4: We initialize the plot diagram for the clusters, as well as the list for storing the centroid coordinates.

  • Lines 6–11: We plot the points of the various clusters, as well as calculate the centroids of every newly formed cluster with the np.mean function and appended to the centroids list (as shown in line 10).

  • Lines 14–16: We plot the newly formed clusters with their respective centroids on the same diagram.

By inspection, we see that the outliers generated as a result of the DBSCAN algorithm are displayed on the scatter plot in the form of pink dots, as these points do not have a centroid plotted around them.

Results without outliers

For the sake of comparison, we'll plot the same scatter plot as shown above, but this time, remove the outliers from the scatter plot.

#plot scatter plot without outliers (with -1 label), with respect to each label one by one in a loop
#list of different colors to differentiate between different clusters
colors = ["red","green","blue","yellow","black","orange","purple","brown","gray","magenta"]
fig, ax = plt.subplots()
centroids=[]
#Note that the length of the listlab list is subtracted by 1 to omit the outlier points being added in the points array for visualization
for i in range(len(listlab)-1):
#points corresponding to a cluster label are placed in the points array and plotted in each iteration
points = np.array([X[j] for j in range(len(X)) if labels[j] == i])
#centroids for these points are generated by taking the mean of the points array in each iteration
centroids.append(list(np.mean(points, axis=0)))
ax.scatter(points[:,0], points[:,1], s=10, c=colors[i])
#centroids of the newly formed clusters are plotted as asterisks
centroids=np.array(centroids)
ax.scatter(centroids[:, 0], centroids[:, 1], marker='*', s=200, c='black')
plt.show()

The line-by-line explanation of the code is given below:

  • Line 3: We initialize a list of different colors to differentiate between different clusters that will be formed as a result of this algorithm

  • Line 4–5: We initialize the plot diagram for the clusters, as well as the list for storing the centroid coordinates.

  • Lines 7–12: We plot the points of the various clusters but omit the points with the -1 label, which denotes outliers. If we go back to the output of the listlab list, we see that the -1 label is at the last index of the list. Therefore, we'll not iterate on the last index of listlab in the for loop (in line 9), removing these outlier points from the plot.

  • Lines 15–17: We plot the newly formed clusters with their respective centroids on the same diagram.

Conclusion

DBSCAN’s robustness to noise, efficiency with appropriate indexing, and capacity to identify outliers as noise points contribute to its popularity in various data mining, machine learning, and image segmentation tasks. Although it requires careful tuning of its hyperparameters (epsilon, min_samples, and metric), DBSCAN remains a reliable tool for uncovering hidden patterns and structures within complex datasets, making it a valuable addition to the toolkit of data scientists and analysts.

Copyright ©2024 Educative, Inc. All rights reserved