Overview of String Matching Algorithms

This lesson will provide an overview of matching algorithms.

String Matching

String-matching consists of finding one or all of the occurrences of a string (“pattern”) in a text. The strings are built over a finite set of characters, called “alphabet”.

There are lots of algorithms that solve this problem; here’s a short list from Wikipedia:

Algorithm Preprocessing Matching Space
Naive string-search none O(nm) none
Rabin–Karp O(m) average O(n + m), worst O((n−m)m) O(1)
Knuth–Morris–Pratt O(m) O(n) O(m)
Boyer–Moore O(m + k) best O(n/m), worst O(mn) O(k)
Boyer–Moore-Horspool O(m + k) best O(n/m), worst O(mn) O(k)

m - the length of the pattern n - the length of the text k - the size of the alphabet

Naive algorithm

The naive algorithm tries to match the pattern at each position of the text:

Pattern = Hello
Text = SuperHelloWorld

   SuperHelloWorld
1. Hello <- XX
2.  Hello <- XX
3.   Hello <- XX
4.    Hello <- XX
5.     Hello <- XX
6.      Hello <- OK!    

In the example above we’re looking for “Hello” in “SuperHelloWorld”. As you can see, the naive version tries each position until it finds the “Hello” at the 6th iteration.

The main difference between the naive way and the other algorithms is that the faster algorithms use additional knowledge about the input pattern. That way, they can skip a lot of fruitless comparisons.

To gain that knowledge, they usually build some lookup tables for the pattern in the preprocessing phase. The size of lookup tables is often tied to the size of the pattern and the alphabet.

Boyer Moore algorithm

In the above case we can skip most of the iterations, as we can observe that when we try to match Hello in the first position, there’s a difference at the last letter o vs r. In fact, since r doesn’t occur in our pattern at all, we can actually move 5 steps further.

Pattern = Hello
Text = SuperHelloWorld

   SuperHelloWorld
1. Hello <- XX,
2.      Hello <- OK   

We have a match with just 2 iterations!

The rule that was used in that example comes from the Boyer Moore algorithm - it’s called The Bad Character Rule.

Get hands-on with 1400+ tech skills courses.