TF-IDF stands for “Term Frequency – Inverse Document Frequency.” It reflects how important a word is to a document in a collection or corpus. This technique is often used in information retrieval and text mining as a weighing factor.
TF-IDF is composed of two terms:
So, essentially, the TF-IDF value increases as the word’s frequency in a document (TF) increases. However, this is offset by the number of times the word appears in the entire collection of documents or corpus (IDF).
We have IDF to help remove common words like “the” or “is” that would, otherwise, have a high term frequency but are not that important.
Let’s look at an example of how TF-IDF works.
Consider two sentences (or documents):
Notice that the only difference between the two sentences is the words “white” and “black”. These are important words that should get a high TF-IDF value, while words like “the” and “cat” should get a low value.