You will probably have heard all the buzz about machine learning and its applications. And if you have, you’ve probably heard about
Classification problems: The
Recommender systems: The
Regression analysis: The
Healthcare and medicine: The
In this blog, our focus will be on only classification problems. We’ll take a look at a numerical example running the
Now, let’s explore the basics of the
What is
Let’s take a look at the
Step 1: To classify a test instance
How to define the value of :
The value is usually an odd number to avoid a 50% split in a binary classification problem. A low value of is more error-prone (or might not give accurate results). In practice, the value of is usually set to be aroundsqrt(n)
, wheren
is the number of instances in the dataset.
Step 2: Calculate the similarity (often using Euclidean distance or other distance metrics) between the new item and all data points in the training dataset. Based on the similarity values, extract
Step 3: Assuming multiple classes (
Step 4: Assign
Assume that we have the following dataset:
Data point | Class |
(2, 3) | A |
(3, 4) | A |
(5, 6) | B |
(7, 8) | B |
(1, 2) | A |
(6, 7) | B |
(4, 5) | A |
(8, 9) | B |
(2, 2) | A |
(9, 9) | B |
We also have the following test data point: (6, 5)
Step 1: To classify a test instance (6, 5), define the
As mentioned above, the value of is usually set to sqrt(n)
, where n
is the number of training instances. In our example dataset, n
is 10, therefore is set to 3.
Step 2: We calculate the similarity between the new instance and all data points in the training dataset.
Note: We’ll be using the Euclidean distance formula in our calculations.
Recall the formula for Euclidean distance:
Here,
After calculating distances, here is a table of distances from each point to our test point, i.e., (6, 5):
Data point | Distance from (6, 5) |
(2, 3) | 4.47 |
(3, 4) | 3.16 |
(5, 6) | 1.41 |
(7, 8) | 3.16 |
(1, 2) | 5.83 |
(6, 7) | 2 |
(4, 5) | 2 |
(8, 9) | 4.47 |
(2, 2) | 5 |
(9, 9) | 5 |
Step 3: Count the number
Data point | Class | Distance from (6, 5) | Rank |
(2, 3) | A | 4.47 | |
(3, 4) | A | 3.16 | |
(5, 6) | B | 1.41 | 1 |
(7, 8) | B | 3.16 | |
(1, 2) | A | 5.83 | |
(6, 7) | B | 2 | 2 |
(4, 5) | A | 2 | 3 |
(8, 9) | B | 4.47 | |
(2, 2) | A | 5 | |
(9, 9) | B | 5 |
Step 4: Assign the test instance (6, 5) the class that’s the most frequent or the majority class.
From the above table, we can see that in neighborhood of the point (6, 5), two of the closest instances belong to class , whereas one instance belongs to class . Therefore, the test instance (6, 5) is assigned class .
The following code implements the math
library) in this implementation example.
import math# Sample datasetdata = [(2, 3, 'A'), (3, 4, 'A'), (5, 6, 'B'), (7, 8, 'B'), (1, 2, 'A'), (6, 7, 'B'), (4, 5, 'A'), (8, 9, 'B'), (2, 2, 'A'), (9, 9, 'B')]# Function to calculate Euclidean distance between two pointsdef euclidean_distance(point1, point2):distance = 0distance = math.sqrt((point1[0] - point2[0])**2 + (point1[1] - point2[1])**2)return distance# k-NN algorithmdef k_nearest_neighbors(data, query_point, k):distances = []# Calculate distances from the query point to all data pointsfor data_point in data:distance = euclidean_distance(query_point, data_point)distances.append((data_point, distance))# Sort distances in ascending orderdistances.sort(key=lambda x: x[1])# Get the k-nearest neighborsneighbors = [item[0] for item in distances[:k]]# Count the occurrences of each class among the neighborsclass_counts = {}for neighbor in neighbors:label = neighbor[2]if label in class_counts:class_counts[label] += 1else:class_counts[label] = 1# Determine the majority classsorted_class_counts = sorted(class_counts.items(), key=lambda x: x[1], reverse=True)return sorted_class_counts[0][0]# Test the k-NN algorithmquery = (6,5)k = 3result = k_nearest_neighbors(data, query, k)print(f"The query point {query} belongs to class: {result}")
The code is explained below:
Line 1: We start by importing the math
library, which is later used to calculate square roots while computing Euclidean distances between two points.
Lines 7–11: We define the euclidean_distance
function between two points.
Line 14: We start defining the k_nearest_neighbors
algorithm, which takes three arguments: data
(the dataset), query_point
(the point for which we want to find k
(the number of neighbors).
Line 15: We initialize an empty list called distances
to store distances between query_point
and the data points.
Lines 18–20: We initialize a loop to iterate through each data point in the dataset. We then call the euclidean_distance
function that calculates the distance between query_point
and the current data_point
and store it in the distance
variable. Finally, we append a tuple containing the data point and its distance to the distances
list.
Line 23: We then sort the distances
list in ascending order based on the distances. This step identifies the
Line 26: Next, we select the distances
list and store them in the neighbors
list.
Lines 29–35: We start by initializing an empty dictionary called class_counts
to count the occurrences of each class among the neighbors. Then, we initiate a loop to iterate through each neighbor in the neighbors
list. We store the class label (label
. We then have an if-else
condition. We first check if the class label already exists. If it does, we increment its count by 1
. Otherwise, we add it to our dictionary with a count of 1
.
Lines 38–39: We now need to determine the majority class. For this, we sort the class_counts
dictionary items in descending order, followed by returning the class label with the highest count (majority class).
Lines 42–45: Now, we need to test our query_point
to be (6, 5)
and set the value of k
to be 3
. We then call the k_nearest_neighbors
function and print the result.
Let’s now take a look at a few advantages and disadvantages of using
Advantages
The algorithm is simple to understand and implement.
Unlike other classification algorithms, there’s no training involved. Learning is instance-based.
The algorithm can adapt to changing data, also known as lazy learning.
No assumptions are made about data distribution.
The models generated by
Disadvantages
The algorithm can be computationally expensive with large datasets.
The algorithm has limited ability to capture complex relationships.
The
Scalability can be an issue with large datasets.
This blog has provided a thorough answer to what is the -NN algorithm. We mentioned some use-cases, along with advantages and disadvantages of using the -NN algorithm for classification. We also demonstrated our example with code without using any Pyhton libraries.
Don’t stop here! You can explore and practice different techniques and libraries to build more accurate and robust models. We encourage you to check out the following courses on Educative:
A Practical Guide to Machine Learning with Python
This course teaches you how to code basic machine learning models. The content is designed for beginners with general knowledge of machine learning, including common algorithms such as linear regression, logistic regression, SVM, KNN, decision trees, and more. If you need a refresher, we have summarized key concepts from machine learning, and there are overviews of specific algorithms dispersed throughout the course.
Machine Learning with Python Libraries
Machine learning is used for software applications that help them generate more accurate predictions. It is a type of artificial intelligence operating worldwide and offers high-paying careers. This path will provide a hands-on guide on multiple Python libraries that play an important role in machine learning. This path also teaches you about neural networks, PyTorch Tensor, PyCaret, and GAN. By the end of this module, you’ll have hands-on experience in using Python libraries to automate your applications.
Mastering Machine Learning Theory and Practice
The machine learning field is rapidly advancing today due to the availability of large datasets and the ability to process big data efficiently. Moreover, several new techniques have produced groundbreaking results for standard machine learning problems. This course provides a detailed description of different machine learning algorithms and techniques, including regression, deep learning, reinforcement learning, Bayes nets, support vector machines (SVMs), and decision trees. The course also offers sufficient mathematical details for a deeper understanding of how different techniques work. An overview of the Python programming language and the fundamental theoretical aspects of ML, including probability theory and optimization, is also included. The course contains several practical coding exercises as well. By the end of the course, you will have a deep understanding of different machine-learning methods and the ability to choose the right method for different applications.
Free Resources