Distance Calculations in NLTK

Discover the alternative forms of distance in the NLTK package, such as Jaccard and Jaro-Winkler.

Packages for distance calculations

Distances are fundamental to natural language processing, and because of this, there are a number of packages aimed to simplify these calculations. The first of which is NLTK, which has distance calculations as a part of its overarching package. Another popular package is Fuzzy-wuzzy, a silly-sounding package that specifically specializes in different types of string matching and distance calculations.

NLTK metrics

There are three main metrics we will cover.

Edit distance

To calculate the edit distance between two strings using Python's NLTK package, you can use the edit_distance() function from the nltk.metrics.distance module. The module is pretty self-explanatory but has a couple of extra parameters.

The edit_distance() function can also take an optional third argument substitution_cost, which specifies the cost of a substitution operation, defaulting to 1.

We can also specify if a transposition counts as an edit (e.g., ba -> ab is 1 edit) by setting transpositions=True. This has some interesting advanced applications that we may explore later in this course.

You can see some sample usage below!

Press + to interact
from nltk.metrics.distance import edit_distance
str1 = "kitten"
str2 = "sitting"
distance = edit_distance(str1, str2, transpositions=False)
print("The edit distance is " + str(distance))

Jaccard distance

Jaccard distance is a measure of dissimilarity between two sets of elements. It is defined as the ratio of the size of the intersection of the sets to the size of their union, where the higher the number, the lower similarity between the two sets. There are useful applications to Jaccard distance such as defining how similar two articles are to one another based on a set of their words. It is also used in certain spell checkers in place of edit distance.

Jaccard Similarity is a related term, measuring the similarity between two sets, and we find Jaccard distance by taking 1Jaccard Similarity1-\text{Jaccard Similarity}.

Jaccard similarity can be expressed as ABAB\frac{|A \cup B|}{|A \cap B|}.

Therefore, The Jaccard distance can be calculated by: J(A,B)=1ABAB=ABABABJ(A,B) = 1 - \frac{|A \cap B|}{|A \cup B|} =\frac{|A \cup B|-|A \cap B|}{|A \cup B|} ...