Nearest Neighbors
Introduction to the Nearest Neighbors algorithm to perform unsupervised learning
We'll cover the following...
Unsupervised learning
Unsupervised learning is a special kind of ML problem where the target is not given or the data is not labeled. During our classification and regression techniques, we have seen that a target is available (continuous, categorical, or binary), and we created a model to predict that target correctly. However, in some problems, the target is not given. For example, consider the grouping of people of similar interests based on age, gender, date of birth, education, and occupation. Here, there is no defined target, like doctor or not doctor. We have to group these people based on the features given. This kind of problem, where the labelled target is not available, is known as an unsupervised learning task.
Some of the applications of unsupervised learning include:
- Grouping news articles based on text
- Grouping images without the label
- Showing related items on an e-commerce website
- Recommending music or movie to a user
Nearest neighbor search
Nearest Neighbor search is the process of finding the closest data point, known as the nearest neighbor. When we want to know about the behavior of data points, they can be understood using other data points with similar properties. Consider an example of a person with some attributes (like age, gender, and education). If we want to know the behavior of a 25 years old male with a master’s degree, we can identify other people with shared attributes and gain valuable insights into that person’s choice.
Another example would be the grouping of blogs or articles over the web. Suppose you are running a blogging website, and new blogs are published every day. You want to find out which previous blog (or blogs) were the most similar to a new publication, so you can put this page under previous blog reader’s recommendations. Finding that nearest point is known as the nearest neighbor search.
Nearest neighbor algorithm
Input: Set of data points(DPS) and the point of the study (P ).
Output: Closest data point from the set.
Algorithm:
min-distance = Infinite
closest-point = None
For data points(DP) in DPS:
If distance(DP,P) < min-distance:
min-distance = distance(DP,P)
closest-point = DP
This algorithm will give us the closest point from the point of interest.
We can modify the above algorithm to report the K closest data points instead of one.
This algorithm has two major elements:
- How we represent a data point.
- How we measure the distance between two data points.
Data representation:
Our goal is to feed data to an algorithm where some similar calculation is applied and a distance value in numbers can be presented. We can convert the data point in the form of a collection of numerical features.
For example, let’s look at the blog of the text. We can create a vocabulary of 100, 1,000, or 10,000 words and count the words that occur in a blog. Consider a small example:
Vocab: {This, is, Cricket, Player, Basketball, No, Football, Call, He, She, Likes, Loves, Hates, Soccer, and, in, for, good, bad }
- S1: He likes cricket and football.
- S2: She hates Football. Football is good.
- Our vocabulary contains 19 features. We can transform these statements into features.
This is just a way to represent text. Another good way is TF-IDF (Term Frequency – Inverse Document Frequency), which is more commonly used and more effective.
Quiz: Word count representation
What is the problem with using count of words?
Common words (i.e., the, an, is) can dominate others.
Rare keywords (like Cricket, Basketball) have little impact.
Both of the above.
Distance
We can use different types of distance to calculate similarity. One example would be Euclidean distance:
If we have N features, then:
...