Visualize the Working of K-Nearest Neighbors
Learn to visualize the working principle behind k-nearest neighbors.
Let’s move on and practically do what we have learned so far. As always, we need to import some basic libraries.
import pandas as pdimport numpy as npimport matplotlib.pyplot as plt;import seaborn as snssns.set_style('whitegrid') # just optional!sns.set(font_scale=1.5) # setting font size for the whole notebooksns.set_style("whitegrid") # setting the style
Let's generate a dataset with two classes and see how the KNN algorithm works in reality for any new data points while assigning the class.
The dataset
We can use make_biclusters()
from scikit-learn to create a simple dataset with two features (columns) and 50 observations (data points). We can also add Gaussian noise while creating clusters and assign them a class. Let's do this.
## Generate 2 random clusters, create dataframefrom sklearn.datasets import make_biclusters # to generate dataX, classes, cols= make_biclusters(shape=(50,2), # features (n_row,n_cols)n_clusters=2, # number of classes we wantnoise=50,# The standard deviation of the gaussian noise.random_state=101) # to re-generate same data everytime# Creating dataframedf = pd.DataFrame(X, columns=['feature_2','feature_1'])df['target']= classes[0]# Well, instead of True/False, lets replace with 1/0 targets -- a practice for map and lambda!df['target'] = df['target'].map(lambda t: '1' if t==0 else '0')print(df.tail(2)) # tail this time!
Let's check the class distribution.
print(df.target.value_counts())
As seen from the code output above, we have the data with two features and a target
column.
Visualize training and the test data
Let's create a scatterplot and visualize the distribution of data points. We can use the hue
parameter for classes to show in different colors. In another plot (right side), we can add a test point for which the class is unknown, and we want KNN to predict its class.
# Figure 1 (left)fig,(ax1,ax2)=plt.subplots(nrows=1,ncols=2,figsize=(16,8))sns.scatterplot(x='feature_1',y='feature_2',data=df,hue='target',ax=ax1,s=150)ax1.set_title("The data -- two classes")ax1.set_xlabel('Feature 1')ax1.set_ylabel('Feature 2')ax1.legend().set_title('Target')# Plot our new pointtest_point=[[10,50]]# Figure 2 (right)sns.scatterplot(x='feature_1',y='feature_2',data=df,hue='target',ax=ax2,s=150)ax2.scatter(x=test_point[0][0],y=test_point[0][1],color="red",marker="*",s=1000)ax2.set_title('Red star is a test (unknown) point')ax2.set_xlabel('Feature 1')ax2.set_ylabel('Feature 2')ax2.legend().set_title('Target')
The red star is a new unknown data point that we want our KNN algorithm to predict, and for this purpose, we need to perform the following ...