SequenceMatcher
is a class that is available in the difflib
Python package.
The
difflib
module provides classes and functions for comparing sequences. It can be used to compare files and can produce information about file differences in various formats.
This class can be used to compare two input sequences or strings. In other words, this class is useful to use when finding similarities between two strings on the character level.
The basic idea behind
SequenceMatcher()
is to find the longest contiguous matching subsequence (LCS) that contains no “junk” elements. Junk are the things that we don’t want the algorithm to match on, like blank lines in ordinary text files,<P>
lines in HTML files, etc. This does not yield minimal edit sequences, but does tend to yield matches that “look right” to people.
Below is the code used to compare two strings:
import difflibstring1 = "I love to eat apple."string2 = "I do not like to eat pineapple."temp = difflib.SequenceMatcher(None,string1 ,string2)print(temp.get_matching_blocks())print('Similarity Score: ',temp.ratio())
Explanation:
On line 1, we import the required package.
On lines 3 and 4, we define the two input strings.
On line 6, we instantiate the object of the SequenceMatcher()
class. We pass the two strings and None
On line 8, we print the continuous matching blocks. You can see in the output that we get a Match
object that contains:
a
: start index of the first string.b
: start index of the second string.size
: length of the match found between the two strings.On line 9 we print the similarity score of the two input strings. The ratio()
function returns the similarity score (float in [0,1]) between input strings and sums the sizes of all matched sequences returned by the get_matching_blocks()
function. It calculates the ratio as:
Ratio = 2.0 * ,
where M= “matches” and T= “total number of elements” in both the sequences.
Now, let’s see how all of this gets calculated:
Match(a=0, b=0, size=2), Match(a=2, b=9, size=1), Match(a=5, b=12, size=9), Match(a=14, b=25, size=6)
. Match(a=20, b=31, size=0)
.The answer is
Free Resources