...

/

Nearest Neighbors

Nearest Neighbors

Introduction to the Nearest Neighbors algorithm to perform unsupervised learning

Unsupervised learning

Unsupervised learning is a special kind of ML problem where the target is not given or the data is not labeled. During our classification and regression techniques, we have seen that a target is available (continuous, categorical, or binary), and we created a model to predict that target correctly. However, in some problems, the target is not given. For example, consider the grouping of people of similar interests based on age, gender, date of birth, education, and occupation. Here, there is no defined target, like doctor or not doctor. We have to group these people based on the features given. This kind of problem, where the labelled target is not available, is known as an unsupervised learning task.

Some of the applications of unsupervised learning include:

  • Grouping news articles based on text
  • Grouping images without the label
  • Showing related items on an e-commerce website
  • Recommending music or movie to a user

Nearest neighbor search

Nearest Neighbor search is the process of finding the closest data point, known as the nearest neighbor. When we want to know about the behavior of data points, they can be understood using other data points with similar properties. Consider an example of a person with some attributes (like age, gender, and education). If we want to know the behavior of a 25 years old male with a master’s degree, we can identify other people with shared attributes and gain valuable insights into that person’s choice.

Another example would be the grouping of blogs or articles over the web. Suppose you are running a blogging website, and new blogs are published every day. You want to find out which previous blog (or blogs) were the most similar to a new publication, so you can put this page under previous blog reader’s recommendations. Finding that nearest point is known as the nearest neighbor search.

Nearest neighbor algorithm

Input: Set of data points(DPS) and the point of the study (P ).

Output: Closest data point from the set.

Algorithm:

min-distance = Infinite

closest-point = None

For data points(DP) in DPS:
    If distance(DP,P) < min-distance:
	    min-distance = distance(DP,P)
	    closest-point = DP

This algorithm will give us the closest point from the point of interest.

We can modify the above algorithm to report the K closest data points instead of one.

This algorithm has two major elements:

  1. How we represent a data point.
  2. How we measure the distance between two data points.

Data representation:

Our goal is to feed data to an algorithm where some similar calculation is applied and a distance value in numbers can be presented. We can convert the data point in the form of a collection of numerical features.

For example, let’s look at the blog of the text. We can create a vocabulary of 100, 1,000, or 10,000 words and count the words that occur in a blog. Consider a small example:

Vocab: {This, is, Cricket, Player, Basketball, No, Football, Call, He, She, Likes, Loves, Hates, Soccer, and, in, for, good, bad }

  • S1: He likes cricket and football.
  • S2: She hates Football. Football is good.
  • Our vocabulary contains 19 features. We can transform these statements into features.
widget

This is just a way to represent text. Another good way is TF-IDF (Term Frequency – Inverse Document Frequency), which is more commonly used and more effective.

Quiz: Word count representation

Q

What is the problem with using count of words?

A)

Common words (i.e., the, an, is) can dominate others.

B)

Rare keywords (like Cricket, Basketball) have little impact.

C)

Both of the above.

Distance

We can use different types of distance to calculate similarity. One example would be Euclidean distance:

Distance(xixp)=xixpDistance(x_i - x_p) = | x_i - x_p|

If we have N features, then:

...