What is the Jaccard Similarity measure in NLP?

Overview

Document/Text similarity is estimating how similar the given documents are to each other. There are different ways of measuring document similarity, such as Cosine Similarity and Euclidean Distance.

Jaccard Similarity is one of the ways to determine the similarity between the documents.

Jaccard Similarity is defined as the ratio of the intersection of the documents to the union of the documents. In other words, it’s the division of the number of tokens common to all documents by the total number of tokens in all documents.

Considering tokens as words in the document, Jaccard Similarity is the ratio of the number of words common to all documents by the total number of words.

The value of Jaccard Similarity ranges from 0 to 1, where 1 indicates the documents are identical while 0 means there is nothing common among the documents.

The mathematical representation of the similarity is as follows:

Example

Consider the following example,

doc_1 = “educative is the best platform out there.”

doc_2 = “educative is a new platform.”

Tokenizing the documents above as words (ignore the punctuations), we get the following:

  • words_doc_1 = {'educative', 'is', 'the', 'best', 'platform', 'out', 'there'}

  • words_doc_2 = {'educative', 'is', 'a', 'new', 'platform'}

The intersection or the common words between the documents are - {'educative', 'is', 'platform'}. 3 words are familiar.

The union or all the words in the documents are - {'educative', 'is', 'the', 'best', 'platform', 'out', 'there', 'a', 'new'}. Totally, there are 9 words.

Hence, the Jaccard similarity is 3/9 = 0.333

Code

def intersection(doc_1, doc_2):
return doc_1.intersection(doc_2)
def union(doc_1, doc_2):
return doc_1.union(doc_2)
def jaccard_similarity(doc_1, doc_2):
words_doc_1 = doc_1.lower().split(' ')
words_doc_2 = doc_2.lower().split(' ')
words_doc_1_set = set(words_doc_1)
words_doc_2_set = set(words_doc_2)
intersection_docs = intersection(words_doc_1_set, words_doc_2_set)
union_docs = union(words_doc_1_set, words_doc_2_set)
return len(intersection_docs) / len(union_docs)
doc_1 = "educative is the best platform out there"
doc_2 = "educative is a new platform"
print("doc_1 - '%s'" % (doc_1, ))
print("doc_2 - '%s'" % (doc_2, ))
print("Jaccard_similarity(doc_1, doc_2) = %s" % (jaccard_similarity(doc_1, doc_2)))

Explanation

  • Line 1-2: intersection function returns the convergence between sets of documents.
  • Line 4-5: union function returns the union between documents.
  • Line 9-10: Each document is converted to lowercase and is split using the space character to get the words in the document.
  • Line 12-13: Each of the documents is converted into sets.
  • Line 15: We get the intersection of doc_1 and doc_2 using the intersection function in lines 1-2.
  • Line 17: We get the union of doc_1 and doc_2 using the union function in lines 4-5.
  • Line 19: The result of dividing the number of words in the intersection by the number of words in the union is returned.

Free Resources