Searching for Matching Documents with tf-idf
Learn about tf-idf and how it is calculated and used in information retrieval, search engines, and other NLP applications.
We'll cover the following
Playing a game with documents
There is a common children’s game called “I Spy.” A group sits in a circle, and the leader says, “I spy, with my little eye, something blue.” Everyone else would then try to guess what the leader was looking at. Was it the blue telephone? Or perhaps the blue couch?
Natural language processing is often similar to this game. Given a document or a word, we have to determine the best-matching document from a list of documents. This is exactly what is done with an internet search or spam filtering.
There are many strategies for this type of search. One of the most common is called term frequency-inverse document frequency or tf-idf.
Note: TF–IDF, TF*IDF, TFIDF, TF–IDF, and Tf–idf all refer to the same concept, which is term frequency-inverse document frequency. We can use any of these forms interchangeably.
Understanding tf-idf
Tf-idf is a measure of the importance of a word in a document relative to its frequency in a corpus of documents.
The “tf” is the term frequency, the number of times a word appears in a document divided by the total number of words in the document. Notice that this is different than just the number of times a word appears in a document, which is also referred to as term frequency.
The “idf” is the inverse document frequency, which measures how rare a word is across all documents in the corpus. This is calculated as the total number of documents in a corpus divided by the number of documents containing the word (or term).
Get hands-on with 1400+ tech skills courses.