What is the Knuth-Morris-Pratt algorithm?

The Knuth-Morris-Pratt (KMP) algorithm is an algorithm that is used to search for a substring (W), in a given string (S), in O(m+n)O(m+n) time (where mm and nn are the lengths of W and S).

Key idea

The key idea used to achieve this time complexity is to minimize the amount of backtracking when a character of W does not match with that of S. This can only be done if we know two things:

  1. Whether or not a proper prefix of W occurs more than once in S ​after at least one character has been correctly found; if it does, it can be skipped when resuming the process of matching after a mismatch.
  2. Length of the proper prefix.
1 character matched
1 of 7

Using an array

Before starting the actual algorithm, a one-dimensional array is initialized with the number of characters that can be skipped after a mismatch. arr[i] represents the number of characters that can be skipped when W[i+1] does not match with a character in S.

In the example above, W[4] (i.e., a) did not match with the character b in S (slide 5), so, the matching was resumed from S[1] instead of S[0] (slide 6). This means that character 1 was skipped; therefore, arr[4-1] = 1.

Failure function

arr is initialized using the failure function f(i)f(i). This function is based on the fact that:

When a mismatch occurs, all the previous characters match correctly; ​this implies that if a prefix of W occurred in this set of matching characters, then that prefix is also a suffix of W.

In other words, arr[i] will represent the length of the longest proper prefix of W, which is also a proper suffix of W (considering W till the ith index only).

The following is the code used to initialize arr:

m = len(W)
i = 0
j = 1

# No proper prefix for string of length 1:
arr[0] = 0

while j < m:
    if W[i] == W[j]:
        i++
        arr[j] = i
        j++
    # The first character didn't match:
    else if i == 0:
        arr[j] = 0
        j++
    # Mismatch after at least one matching character:
    else:
        i = arr[i - 1]
Initialization
1 of 16

Algorithm

The following is the KMP algorithm pseudocode used for searching substring W, in string S, in O(m+n)O(m+n):

# Initialising variables:
i = 0
j = 0
m = len(W)
n = len(S)

# Start search:
while (i < m && j < n){
    # Last character matches -> Substring found:
    if (W[i] == S[j] && i == m - 1):
        return true

    # Character matches:
    else if (W[i] == S[j]):
        i++
        j++
    
    # Character didn't match -> Backtrack:
    else:
        i = arr[i - 1]
        if i == 0:
            j++
}

# Substring not found:
return false

Implementation

The algorithm above (given in the form of a pseudocode) along with the failure function, has been implemented in Python:

# Initialise array based on the failure function:
def init_arr(w):
m = len(w)
i = 0
j = 1
# No proper prefix for string of length 1:
global arr
arr[0] = 0
while j < m:
if w[i] == w[j]:
i += 1
arr[j] = i
j += 1
# The first character didn't match:
elif i == 0:
arr[j] = 0
j += 1
# Mismatch after at least one matching character:
else:
i = arr[i - 1]
def kmp_search(w, s):
# Initialise array:
init_arr(w)
# Initialising variables:
i = 0
j = 0
m = len(w)
n = len(s)
# Start search:
while i < m and j < n:
# Last character matches -> Substring found:
if w[i] == s[j] and i == m - 1:
return True
# Character matches:
elif w[i] == s[j]:
i += 1
j += 1
# Character didn't match -> Backtrack:
else:
if i != 0:
i = arr[i-1]
else:
j += 1
# Substring not found:
return False
text = "hayhello"
substring = "hell"
# create array:
arr = [None] * len(substring)
if kmp_search(substring, text):
print("Substring found!")
else:
print("Could not find substring.")
Copyright ©2024 Educative, Inc. All rights reserved