DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a versatile clustering algorithm used in data mining and machine learning. It is known to be effective at identifying clusters of data points in a dataset that have varying shapes, sizes, and densities. Unlike some algorithms, such as the k-means algorithm, DBSCAN does not require the user to specify the number of clusters in advance. This makes it well-suited for data with an unknown number of clusters.
Note: To read more about the DBSCAN algorithm, check out this answer.
To implement DBSCAN on a given dataset, we will follow a couple of steps that are outlined below.
To start off, we will load the dataset required to use the DBSCAN algorithm. This is done via the pandas
library, which has the read_csv()
function. To keep things simple, we’ll construct a two-dimensional dataset for easier visualization afterward. After this, we’ll plot a scatter plot of the loaded dataset to visualize the distribution of the datapoints on the scatter plot.
#import the necessary libaries for this stepimport pandas as pdimport matplotlib.pyplot as pltimport numpy as np#read the csv dataset to store in the dataset variabledataset = pd.read_csv('/Employee.csv')#displaying the first 5 rows of the dataset (optional)print(dataset.head(5))#select the two columns from the dataset to make the data 2DX = np.array([dataset['Income'],dataset['Expenditure Score(1-100)']])#taking transpose to correct the dimension of the 2d points for plotting the scatter plotX=np.transpose(X)#label the axesplt.xlabel('Income')plt.ylabel('Expenditure Score')#plotting the scatter plot of the 2D dataplt.scatter(X[:,0], X[:,1], c='black', s=20)
In the scatter plot generated above, we can see that we have taken a dataset in the form of a CSV file named as Employee.csv
from which two columns are used, namely “Income” and “Expenditure Score (1-100).”
Note: We can use more columns if we want, but our data will become multidimensional making it harder to visualize and process later on.
Next, we’ll use the DBSCAN.fit()
method that is part of the DBSCAN
library (from sklearn
) to generate the labels for every datapoint to create new clusters. For this example, we have taken the value of the epsilon (eps
) parameter to be three, the value of minimum samples per cluster (min_sample
) to be four with an Euclidean distance metric.
from sklearn.cluster import DBSCAN#fit the datapoints onto the DBSCAN algorithmdbscan = DBSCAN(eps=3, min_samples=4, metric='euclidean').fit(X)#extracting the new labels of the datapointslabels=dbscan.fit_predict(X)ans = dbscan.labels_#display the predicted labels of the datapointsprint('Predicted labels:')print(ans)#extract the set of possible labels from the predicted label listlistlab=list(set(ans))print("Unique labels:")print(listlab)
Note: Using different values of the parameter
epsilon
,min_samples
, andmetric
will result in different clusters being formed. For instance, by increasing the value of themin_samples
parameter, every new cluster would have more points inside it. Conversely, there will be less datapoints inside each cluster.
Here, we can see that the total number of unique labels generated by the DBSCAN algorithm turns out to be ten, with the outlier label denoted as -1
. This means that there will be a total of nine clusters, excluding the outliers in our final scatter plot in the coming step.
Lastly, we plot the datapoints with respect to their new cluster labels and centroids via a scatter plot. The code is given below:
#list of different colors to differentiate between different clusters as there might be many clusters form as a result of the DBSCAN algorithmcolors = ["red","green","blue","yellow","black","orange","purple","brown","gray","magenta"]fig, ax = plt.subplots()centroids=[]#plot scatter plot with outliers (with -1 label), with respect to each label one by one in a loopfor i in range(len(listlab)):#points corresponding to a cluster label are placed in the points array and plotted in each iterationpoints = np.array([X[j] for j in range(len(X)) if labels[j] == listlab[i]])#centroids for these points are generated by taking the mean of the points array in each iterationcentroids.append(list(np.mean(points, axis=0)))ax.scatter(points[:,0], points[:,1], s=10, c=colors[i])#centroids of the newly formed clusters are plotted as asteriskscentroids=np.array(centroids)ax.scatter(centroids[:, 0], centroids[:, 1], marker='*', s=200, c='black')plt.show()
The line-by-line explanation of the code is given below:
Line 2: We initialize a list of different colors to differentiate between different clusters that will be formed as a result of this algorithm
Lines 3–4: We initialize the plot diagram for the clusters, as well as the list for storing the centroid coordinates.
Lines 6–11: We plot the points of the various clusters, as well as calculate the centroids of every newly formed cluster with the np.mean
function and appended to the centroids
list (as shown in line 10).
Lines 14–16: We plot the newly formed clusters with their respective centroids on the same diagram.
By inspection, we see that the outliers generated as a result of the DBSCAN algorithm are displayed on the scatter plot in the form of pink dots, as these points do not have a centroid plotted around them.
For the sake of comparison, we'll plot the same scatter plot as shown above, but this time, remove the outliers from the scatter plot.
#plot scatter plot without outliers (with -1 label), with respect to each label one by one in a loop#list of different colors to differentiate between different clusterscolors = ["red","green","blue","yellow","black","orange","purple","brown","gray","magenta"]fig, ax = plt.subplots()centroids=[]#Note that the length of the listlab list is subtracted by 1 to omit the outlier points being added in the points array for visualizationfor i in range(len(listlab)-1):#points corresponding to a cluster label are placed in the points array and plotted in each iterationpoints = np.array([X[j] for j in range(len(X)) if labels[j] == i])#centroids for these points are generated by taking the mean of the points array in each iterationcentroids.append(list(np.mean(points, axis=0)))ax.scatter(points[:,0], points[:,1], s=10, c=colors[i])#centroids of the newly formed clusters are plotted as asteriskscentroids=np.array(centroids)ax.scatter(centroids[:, 0], centroids[:, 1], marker='*', s=200, c='black')plt.show()
The line-by-line explanation of the code is given below:
Line 3: We initialize a list of different colors to differentiate between different clusters that will be formed as a result of this algorithm
Line 4–5: We initialize the plot diagram for the clusters, as well as the list for storing the centroid coordinates.
Lines 7–12: We plot the points of the various clusters but omit the points with the -1
label, which denotes outliers. If we go back to the output of the listlab
list, we see that the -1
label is at the last index of the list. Therefore, we'll not iterate on the last index of listlab
in the for loop (in line 9), removing these outlier points from the plot.
Lines 15–17: We plot the newly formed clusters with their respective centroids on the same diagram.
DBSCAN’s robustness to noise, efficiency with appropriate indexing, and capacity to identify outliers as noise points contribute to its popularity in various data mining, machine learning, and image segmentation tasks. Although it requires careful tuning of its hyperparameters (epsilon
, min_samples
, and metric
), DBSCAN remains a reliable tool for uncovering hidden patterns and structures within complex datasets, making it a valuable addition to the toolkit of data scientists and analysts.