Implementing DBSCAN algorithm with sklearn

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a versatile clustering algorithm used in data mining and machine learning. It is known to be effective at identifying clusters of data points in a dataset that have varying shapes, sizes, and densities. Unlike some algorithms, such as the k-means algorithm, DBSCAN does not require the user to specify the number of clusters in advance. This makes it well-suited for data with an unknown number of clusters.

Step-by-step method

To implement DBSCAN on a given dataset, we will follow a couple of steps that are outlined below.

Step 1: Visualize the dataset

To start off, we will load the dataset required to use the DBSCAN algorithm. This is done via the pandas library, which has the read_csv() function. To keep things simple, we’ll construct a two-dimensional dataset for easier visualization afterward. After this, we’ll plot a scatter plot of the loaded dataset to visualize the distribution of the datapoints on the scatter plot.

#import the necessary libaries for this step
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
#read the csv dataset to store in the dataset variable
dataset = pd.read_csv('/Employee.csv')
#displaying the first 5 rows of the dataset (optional)
print(dataset.head(5))
#select the two columns from the dataset to make the data 2D
X = np.array([dataset['Income'],dataset['Expenditure Score(1-100)']])
#taking transpose to correct the dimension of the 2d points for plotting the scatter plot 
X=np.transpose(X)
#label the axes 
plt.xlabel('Income')
plt.ylabel('Expenditure Score')
#plotting the scatter plot of the 2D data
plt.scatter(X[:,0], X[:,1], c='black', s=20)

#list of different colors to differentiate between different clusters as there might be many clusters form as a result of the DBSCAN algorithm
colors = ["red","green","blue","yellow","black","orange","purple","brown","gray","magenta"]
fig, ax = plt.subplots()
centroids=[]
#plot scatter plot with outliers (with -1 label), with respect to each label one by one in a loop 
for i in range(len(listlab)):
        #points corresponding to a cluster label are placed in the points array and plotted in each iteration  
        points = np.array([X[j] for j in range(len(X)) if labels[j] == listlab[i]])
        #centroids for these points are generated by taking the mean of the points array in each iteration
        centroids.append(list(np.mean(points, axis=0)))
        ax.scatter(points[:,0], points[:,1], s=10, c=colors[i])
#centroids of the newly formed clusters are plotted as asterisks       
centroids=np.array(centroids)
ax.scatter(centroids[:, 0], centroids[:, 1], marker='*', s=200, c='black')
plt.show()

The line-by-line explanation of the code is given below:

Line 2: We initialize a list of different colors to differentiate between different clusters that will be formed as a result of this algorithm
Lines 3–4: We initialize the plot diagram for the clusters, as well as the list for storing the centroid coordinates.
Lines 6–11: We plot the points of the various clusters, as well as calculate the centroids of every newly formed cluster with the np.mean function and appended to the centroids list (as shown in line 10).
Lines 14–16: We plot the newly formed clusters with their respective centroids on the same diagram.

By inspection, we see that the outliers generated as a result of the DBSCAN algorithm are displayed on the scatter plot in the form of pink dots, as these points do not have a centroid plotted around them.

Results without outliers

For the sake of comparison, we'll plot the same scatter plot as shown above, but this time, remove the outliers from the scatter plot.

#plot scatter plot without outliers (with -1 label), with respect to each label one by one in a loop
#list of different colors to differentiate between different clusters
colors = ["red","green","blue","yellow","black","orange","purple","brown","gray","magenta"]
fig, ax = plt.subplots()
centroids=[]
#Note that the length of the listlab list is subtracted by 1 to omit the outlier points being added in the points array for visualization 
for i in range(len(listlab)-1):
        #points corresponding to a cluster label are placed in the points array and plotted in each iteration  
        points = np.array([X[j] for j in range(len(X)) if labels[j] == i])
        #centroids for these points are generated by taking the mean of the points array in each iteration
        centroids.append(list(np.mean(points, axis=0)))
        ax.scatter(points[:,0], points[:,1], s=10, c=colors[i])
#centroids of the newly formed clusters are plotted as asterisks       
centroids=np.array(centroids)
ax.scatter(centroids[:, 0], centroids[:, 1], marker='*', s=200, c='black')
plt.show()

The line-by-line explanation of the code is given below:

Line 3: We initialize a list of different colors to differentiate between different clusters that will be formed as a result of this algorithm
Line 4–5: We initialize the plot diagram for the clusters, as well as the list for storing the centroid coordinates.
Lines 7–12: We plot the points of the various clusters but omit the points with the -1 label, which denotes outliers. If we go back to the output of the listlab list, we see that the -1 label is at the last index of the list. Therefore, we'll not iterate on the last index of listlab in the for loop (in line 9), removing these outlier points from the plot.
Lines 15–17: We plot the newly formed clusters with their respective centroids on the same diagram.

Conclusion

DBSCAN’s robustness to noise, efficiency with appropriate indexing, and capacity to identify outliers as noise points contribute to its popularity in various data mining, machine learning, and image segmentation tasks. Although it requires careful tuning of its hyperparameters (epsilon, min_samples, and metric), DBSCAN remains a reliable tool for uncovering hidden patterns and structures within complex datasets, making it a valuable addition to the toolkit of data scientists and analysts.

Implementing DBSCAN algorithm with sklearn

Step-by-step method

Step 1: Visualize the dataset

Step 2: Generate predictions

Step 3: Plot the results

Results without outliers

Conclusion