...

Nearest Neighbors

Introduction to the Nearest Neighbors algorithm to perform unsupervised learning

We'll cover the following...

Unsupervised learning
Nearest neighbor search
Nearest neighbor algorithm
Data representation:
Distance
Cosine similarity

Unsupervised learning is a special kind of ML problem where the target is not given or the data is not labeled. During our classification and regression techniques, we have seen that a target is available (continuous, categorical, or binary), and we created a model to predict that target correctly. However, in some problems, the target is not given. For example, consider the grouping of people of similar interests based on age, gender, date of birth, education, and occupation. Here, there is no defined target, like doctor or not doctor. We have to group these people based on the features given. This kind of problem, where the labelled target is not available, is known as an unsupervised learning task.

Some of the applications of unsupervised learning include:

Grouping news articles based on text
Grouping images without the label
Showing related items on an e-commerce website
Recommending music or movie to a user

Nearest Neighbor search is the process of finding the closest data point, known as the nearest neighbor. When we want to know about the behavior of data points, they can be understood using other data points with similar properties. Consider an example of a person with some attributes (like age, gender, and education). If we want to know the behavior of a 25 years old male with a master’s degree, we can identify other people with shared attributes and gain valuable insights into that person’s choice.

Another example would be the grouping of blogs or articles over the web. Suppose you are running a blogging website, and new blogs are published every day. You want to find out which previous blog (or blogs) were the most similar to a new publication, so you can put this page under previous blog reader’s recommendations. Finding that nearest point is known as the nearest neighbor search.

Nearest neighbor algorithm

Input: Set of data points(DPS) and the point of the study (P ).

Output: Closest data point from the set.

Algorithm:

min-distance = Infinite

closest-point = None

For data points(DP) in DPS:
    If distance(DP,P) < min-distance:
	    min-distance = distance(DP,P)
	    closest-point = DP

This algorithm will give us the closest point from the point of interest.

We can modify the above algorithm to report the K closest data points instead of one.

This algorithm has two major elements:

How we represent a data point.
How we measure the distance between two data points.

Data representation:

Our goal is to feed data to an algorithm where some similar calculation is applied and a distance value in numbers can be presented. We can convert the data point in the form of a collection of numerical features.

For example, let’s look at the blog of the text. We can create a vocabulary of 100, 1,000, or 10,000 words and count the words that occur in a blog. Consider a small example:

Vocab: {This, is, Cricket, Player, Basketball, No, Football, Call, He, She, Likes, Loves, Hates, Soccer, and, in, for, good, bad }

S1: He likes cricket and football.

S2: She hates Football. Football is good.

Our vocabulary contains 19 features. We can transform these statements into features.

Are You Ready to Become a Data Scientist?

Python Basics

Python Libraries

More Data Science Tools

Data Structures and Algorithms - I

Data Structures and Algorithms - II

Statistics and Probability

Feature Engineering

Basics of Machine Learning

Regression

Classification

Unsupervised Learning

Advanced Topics in Machine Learning

Conclusion

Nearest Neighbors

Unsupervised learning

Nearest neighbor search

Nearest neighbor algorithm

Data representation:

Distance