Home/Blog/Data Science/What is k-NN?
Home/Blog/Data Science/What is k-NN?

What is k-NN?

7 min read
Nov 16, 2023
content
Overview
Introduction to kkk-NN
How kkk-NN works
Example
kkk-NN algorithm in Python
Advantages and disadvantages of using the kkk-NN classification algorithm
Conclusion and next steps

Become a Software Engineer in Months, Not Years

From your first line of code, to your first day on the job — Educative has you covered. Join 2M+ developers learning in-demand programming skills.

Overview#

You will probably have heard all the buzz about machine learning and its applications. And if you have, you’ve probably heard about kk-nearest neighbors (kk-NN). This algorithm is one of the simplest and easy-to-understand classification and regression algorithms and can be used for many practical applications. Some of these applications include:

  • Classification problems: The kk-NN algorithm can be used in pattern recognition and other classification problems, such as identification of spam emails and classification of documents.

  • Recommender systems: The kk-NN algorithm can be applied to find similar users or items and make recommendations.

  • Regression analysis: The kk-NN algorithm can also be used for regression tasks, such as predicting housing prices based on features such as area, location, and number of bedrooms.

  • Healthcare and medicine: The kk-NN algorithm can assist in identifying the likelihood of certain diseases based on patient data and historical records.

In this blog, our focus will be on only classification problems. We’ll take a look at a numerical example running the kk-NN algorithm to see how it works. We’ll also run the example via Python code. Finally, at the end of this blog, we’ll have a look at some advantages and disadvantages of using the kk-NN algorithm for classification purposes.

Now, let’s explore the basics of the kk-NN algorithm.

Introduction to kkk-NN#

What is kk-NN? As mentioned above, kk-NN is a widely recognized classification technique used to assign items to particular categories based on how similar they are to nearby data points. It falls under the category of instance-based or lazy learning algorithms. Unlike some other algorithms that build explicit models during training, kk-NN makes predictions by finding the most similar data points in the training dataset to the item being classified.

k-NN model visualization
k-NN model visualization

How kkk-NN works#

Let’s take a look at the kk-NN algorithm step by step:

Step 1: To classify a test instance dd, define the kk-neighborhood PP as kk-nearest neighbors of dd.

How to define the value of kk:
The value kk is usually an odd number to avoid a 50% split in a binary classification problem. A low value of kk is more error-prone (or might not give accurate results). In practice, the value of kk is usually set to be around sqrt(n), where n is the number of instances in the dataset.

Step 2: Calculate the similarity (often using Euclidean distance or other distance metrics) between the new item and all data points in the training dataset. Based on the similarity values, extract kk instances forming PP.

k-NN in steps
k-NN in steps

Step 3: Assuming multiple classes (c1c1, c2c2, c3c3, ..., and so on), count the number nin_i of training instances in PP that belong to class cic_i. An optional step is usually to sort the distances in ascending order before counting the instances for each class.

Step 4: Assign dd the class that is the most frequent or the majority class.

Example#

Assume that we have the following dataset:

Data point

Class

(2, 3)

A

(3, 4)

A

(5, 6)

B

(7, 8)

B

(1, 2)

A

(6, 7)

B

(4, 5)

A

(8, 9)

B

(2, 2)

A

(9, 9)

B

We also have the following test data point: (6, 5)

Step 1: To classify a test instance (6, 5), define the kk-neighborhood PP askknearest neighbors of the test point.

As mentioned above, the value of kk is usually set to sqrt(n), where n is the number of training instances. In our example dataset, n is 10, therefore kk is set to 3.

Step 2: We calculate the similarity between the new instance and all data points in the training dataset.

Note: We’ll be using the Euclidean distance formula in our calculations.

Recall the formula for Euclidean distance:

Here, xix_i​ and yiy_i are the iith parameters of the x\mathbf{x} and y\mathbf{y} data instances, and nn is the number of features in each instance.

After calculating distances, here is a table of distances from each point to our test point, i.e., (6, 5):

Data point

Distance from (6, 5)

(2, 3)

4.47

(3, 4)

3.16

(5, 6)

1.41

(7, 8)

3.16

(1, 2)

5.83

(6, 7)

2

(4, 5)

2

(8, 9)

4.47

(2, 2)

5

(9, 9)

5

Step 3: Count the number nin_i of training instances in PP that belong to class cic_i.

Data point

Class

Distance from (6, 5)

Rank

(2, 3)

A

4.47

(3, 4)

A

3.16

(5, 6)

B

1.41

1

(7, 8)

B

3.16

(1, 2)

A

5.83

(6, 7)

B

2

2

(4, 5)

A

2

3

(8, 9)

B

4.47

(2, 2)

A

5

(9, 9)

B

5

Step 4: Assign the test instance (6, 5) the class that’s the most frequent or the majority class.

From the above table, we can see that in neighborhood PP of the point (6, 5), two of the closest instances belong to class BB, whereas one instance belongs to class AA. Therefore, the test instance (6, 5) is assigned class BB.

kkk-NN algorithm in Python#

The following code implements the kk-NN algorithm in Python. Note that we won’t be using any libraries (except for the math library) in this implementation example.

import math
# Sample dataset
data = [(2, 3, 'A'), (3, 4, 'A'), (5, 6, 'B'), (7, 8, 'B'), (1, 2, 'A'), (6, 7, 'B'), (4, 5, 'A'), (8, 9, 'B'), (2, 2, 'A'), (9, 9, 'B')]
# Function to calculate Euclidean distance between two points
def euclidean_distance(point1, point2):
distance = 0
distance = math.sqrt((point1[0] - point2[0])**2 + (point1[1] - point2[1])**2)
return distance
# k-NN algorithm
def k_nearest_neighbors(data, query_point, k):
distances = []
# Calculate distances from the query point to all data points
for data_point in data:
distance = euclidean_distance(query_point, data_point)
distances.append((data_point, distance))
# Sort distances in ascending order
distances.sort(key=lambda x: x[1])
# Get the k-nearest neighbors
neighbors = [item[0] for item in distances[:k]]
# Count the occurrences of each class among the neighbors
class_counts = {}
for neighbor in neighbors:
label = neighbor[2]
if label in class_counts:
class_counts[label] += 1
else:
class_counts[label] = 1
# Determine the majority class
sorted_class_counts = sorted(class_counts.items(), key=lambda x: x[1], reverse=True)
return sorted_class_counts[0][0]
# Test the k-NN algorithm
query = (6,5)
k = 3
result = k_nearest_neighbors(data, query, k)
print(f"The query point {query} belongs to class: {result}")

The code is explained below:

Line 1: We start by importing the math library, which is later used to calculate square roots while computing Euclidean distances between two points.

Lines 7–11: We define the euclidean_distance function between two points.

Line 14: We start defining the k_nearest_neighbors algorithm, which takes three arguments: data (the dataset), query_point (the point for which we want to find kk-nearest neighbors), and k (the number of neighbors).

Line 15: We initialize an empty list called distances to store distances between query_point and the data points.

Lines 18–20: We initialize a loop to iterate through each data point in the dataset. We then call the euclidean_distance function that calculates the distance between query_point and the current data_point and store it in the distance variable. Finally, we append a tuple containing the data point and its distance to the distances list.

Line 23: We then sort the distances list in ascending order based on the distances. This step identifies the kk-nearest neighbors.

Line 26: Next, we select the kk-nearest neighbors from the sorted distances list and store them in the neighbors list.

Lines 29–35: We start by initializing an empty dictionary called class_counts to count the occurrences of each class among the neighbors. Then, we initiate a loop to iterate through each neighbor in the neighbors list. We store the class label (AA or BB) for the current neighbor in the variable called label. We then have an if-else condition. We first check if the class label already exists. If it does, we increment its count by 1. Otherwise, we add it to our dictionary with a count of 1.

Lines 38–39: We now need to determine the majority class. For this, we sort the class_counts dictionary items in descending order, followed by returning the class label with the highest count (majority class).

Lines 42–45: Now, we need to test our kk-NN algorithm. For that, we define query_point to be (6, 5) and set the value of k to be 3. We then call the k_nearest_neighbors function and print the result.

Advantages and disadvantages of using the kkk-NN classification algorithm#

Let’s now take a look at a few advantages and disadvantages of using kk-NN for classification tasks.

Advantages

  • The algorithm is simple to understand and implement.

  • Unlike other classification algorithms, there’s no training involved. Learning is instance-based.

  • The algorithm can adapt to changing data, also known as lazy learning.

  • No assumptions are made about data distribution.

  • The models generated by kk-NN are interpretable. We can easily visualize decision boundaries.

Disadvantages

  • The algorithm can be computationally expensive with large datasets.

  • kk-NN is sensitive to the choice of kk.

  • The algorithm has limited ability to capture complex relationships.

  • The kk-NN algorithm might suffer from the curse of dimensionality. This curse refers to the phenomenon where the performance of algorithms such as kk-NN degrades as the number of features or dimensions in the dataset increases.

  • Scalability can be an issue with large datasets.

Conclusion and next steps#

This blog has provided a thorough answer to what is the kk-NN algorithm. We mentioned some use-cases, along with advantages and disadvantages of using the kk-NN algorithm for classification. We also demonstrated our example with code without using any Pyhton libraries.

Don’t stop here! You can explore and practice different techniques and libraries to build more accurate and robust models. We encourage you to check out the following courses on Educative:

A Practical Guide to Machine Learning with Python

Cover
A Practical Guide to Machine Learning with Python

This course teaches you how to code basic machine learning models. The content is designed for beginners with general knowledge of machine learning, including common algorithms such as linear regression, logistic regression, SVM, KNN, decision trees, and more. If you need a refresher, we have summarized key concepts from machine learning, and there are overviews of specific algorithms dispersed throughout the course.

72hrs 30mins
Beginner
108 Playgrounds
12 Quizzes

Machine Learning with Python Libraries

Cover
Machine Learning with Python Libraries

Machine learning is used for software applications that help them generate more accurate predictions. It is a type of artificial intelligence operating worldwide and offers high-paying careers. This path will provide a hands-on guide on multiple Python libraries that play an important role in machine learning. This path also teaches you about neural networks, PyTorch Tensor, PyCaret, and GAN. By the end of this module, you’ll have hands-on experience in using Python libraries to automate your applications.

53hrs
Beginner
56 Challenges
62 Quizzes

Mastering Machine Learning Theory and Practice

Cover
Mastering Machine Learning Theory and Practice

The machine learning field is rapidly advancing today due to the availability of large datasets and the ability to process big data efficiently. Moreover, several new techniques have produced groundbreaking results for standard machine learning problems. This course provides a detailed description of different machine learning algorithms and techniques, including regression, deep learning, reinforcement learning, Bayes nets, support vector machines (SVMs), and decision trees. The course also offers sufficient mathematical details for a deeper understanding of how different techniques work. An overview of the Python programming language and the fundamental theoretical aspects of ML, including probability theory and optimization, is also included. The course contains several practical coding exercises as well. By the end of the course, you will have a deep understanding of different machine-learning methods and the ability to choose the right method for different applications.

36hrs
Beginner
109 Playgrounds
10 Quizzes

Written By:
Kamran Lodhi
Join 2.5 million developers at
Explore the catalog

Free Resources