Document/Text similarity is estimating how similar the given documents are to each other. There are different ways of measuring document similarity, such as Cosine Similarity and Euclidean Distance.
Jaccard Similarity is one of the ways to determine the similarity between the documents.
Jaccard Similarity is defined as the ratio of the intersection of the documents to the union of the documents. In other words, it’s the division of the number of tokens common to all documents by the total number of tokens in all documents.
Considering tokens as words in the document, Jaccard Similarity is the ratio of the number of words common to all documents by the total number of words.
The value of Jaccard Similarity ranges from 0
to 1
, where 1
indicates the documents are identical while 0
means there is nothing common among the documents.
The mathematical representation of the similarity is as follows:
Consider the following example,
doc_1
= “educative is the best platform out there.”
doc_2
= “educative is a new platform.”
Tokenizing the documents above as words (ignore the punctuations), we get the following:
words_doc_1 = {'educative', 'is', 'the', 'best', 'platform', 'out', 'there'}
words_doc_2 = {'educative', 'is', 'a', 'new', 'platform'}
The intersection or the common words between the documents are - {'educative', 'is', 'platform'}
. 3
words are familiar.
The union or all the words in the documents are - {'educative', 'is', 'the', 'best', 'platform', 'out', 'there', 'a', 'new'}
. Totally, there are 9
words.
Hence, the Jaccard similarity is 3/9 = 0.333
def intersection(doc_1, doc_2):return doc_1.intersection(doc_2)def union(doc_1, doc_2):return doc_1.union(doc_2)def jaccard_similarity(doc_1, doc_2):words_doc_1 = doc_1.lower().split(' ')words_doc_2 = doc_2.lower().split(' ')words_doc_1_set = set(words_doc_1)words_doc_2_set = set(words_doc_2)intersection_docs = intersection(words_doc_1_set, words_doc_2_set)union_docs = union(words_doc_1_set, words_doc_2_set)return len(intersection_docs) / len(union_docs)doc_1 = "educative is the best platform out there"doc_2 = "educative is a new platform"print("doc_1 - '%s'" % (doc_1, ))print("doc_2 - '%s'" % (doc_2, ))print("Jaccard_similarity(doc_1, doc_2) = %s" % (jaccard_similarity(doc_1, doc_2)))
intersection
function returns the convergence between sets of documents.union
function returns the union between documents.doc_1
and doc_2
using the intersection
function in lines 1-2.doc_1
and doc_2
using the union
function in lines 4-5.