Distance Calculations in NLTK
Discover the alternative forms of distance in the NLTK package, such as Jaccard and Jaro-Winkler.
We'll cover the following...
Packages for distance calculations
Distances are fundamental to natural language processing, and because of this, there are a number of packages aimed to simplify these calculations. The first of which is NLTK, which has distance calculations as a part of its overarching package. Another popular package is Fuzzy-wuzzy, a silly-sounding package that specifically specializes in different types of string matching and distance calculations.
NLTK metrics
There are three main metrics we will cover.
Edit distance
To calculate the edit distance between two strings using Python's NLTK package, you can use the edit_distance()
function from the nltk.metrics.distance
module. The module is pretty self-explanatory but has a couple of extra parameters.
The edit_distance()
function can also take an optional third argument substitution_cost
, which specifies the cost of a substitution operation, defaulting to 1
.
We can also specify if a transposition counts as an edit (e.g., ba
-> ab
is 1 edit) by setting transpositions=True
. This has some interesting advanced applications that we may explore later in this course.
You can see some sample usage below!
from nltk.metrics.distance import edit_distancestr1 = "kitten"str2 = "sitting"distance = edit_distance(str1, str2, transpositions=False)print("The edit distance is " + str(distance))
Jaccard distance
Jaccard distance is a measure of dissimilarity between two sets of elements. It is defined as the ratio of the size of the intersection of the sets to the size of their union, where the higher the number, the lower similarity between the two sets. There are useful applications to Jaccard distance such as defining how similar two articles are to one another based on a set of their words. It is also used in certain spell checkers in place of edit distance.
Jaccard Similarity is a related term, measuring the similarity between two sets, and we find Jaccard distance by taking .
Jaccard similarity can be expressed as .
Therefore, The Jaccard distance can be calculated by: ...