What is TF-IDF?

TF-IDF stands for “Term Frequency – Inverse Document Frequency.” It reflects how important a word is to a document in a collection or corpus. This technique is often used in information retrieval and text mining as a weighing factor.

TF-IDF is composed of two terms:

widget
  • Term Frequency (TF):
    The number of times a word appears in a document divided by the total number of words in that document.
widget
  • Inverse Document Frequency (IDF):
    The logarithm of the number of the documents in the corpus divided by the number of documents where the specific term appears.
widget

So, essentially, the TF-IDF value increases as the word’s frequency in a document (TF) increases. However, this is offset by the number of times the word appears in the entire collection of documents or corpus (IDF).

We have IDF to help remove common words like “the” or “is” that would, otherwise, have a high term frequency but are not that important.

Example

Let’s look at an example of how TF-IDF works.

Consider two sentences (or documents):

  1. “The cat is white”
  2. “The cat is black”

Notice that the only difference between the two sentences is the words “white” and “black”. These are important words that should get a high TF-IDF value, while words like “the” and “cat” should get a low value.

TF-IDF value for the word "white"
TF-IDF value for the word "white"
TF-IDF value for the word "the"
TF-IDF value for the word "the"
Copyright ©2024 Educative, Inc. All rights reserved