The Knuth-Morris-Pratt (KMP) algorithm is an algorithm that is used to search for a substring (W
), in a given string (S
), in time (where and are the lengths of W
and S
).
The key idea used to achieve this time complexity is to minimize the amount of backtracking when a character of W
does not match with that of S
. This can only be done if we know two things:
W
occurs more than once in S
after at least one character has been correctly found; if it does, it can be skipped when resuming the process of matching after a mismatch.Before starting the actual algorithm, a one-dimensional array is initialized with the number of characters that can be skipped after a mismatch. arr[i]
represents the number of characters that can be skipped when W[i+1]
does not match with a character in S
.
In the example above, W[4]
(i.e., a
) did not match with the character b
in S
(slide 5), so, the matching was resumed from S[1]
instead of S[0]
(slide 6). This means that character 1 was skipped; therefore, arr[4-1]
= 1.
arr
is initialized using the failure function . This function is based on the fact that:
When a mismatch occurs, all the previous characters match correctly; this implies that if a prefix of W
occurred in this set of matching characters, then that prefix is also a suffix of W
.
In other words, arr[i]
will represent the length of the longest proper prefix of W
, which is also a proper suffix of W
(considering W
till the i
th index only).
The following is the code used to initialize arr
:
m = len(W)
i = 0
j = 1
# No proper prefix for string of length 1:
arr[0] = 0
while j < m:
if W[i] == W[j]:
i++
arr[j] = i
j++
# The first character didn't match:
else if i == 0:
arr[j] = 0
j++
# Mismatch after at least one matching character:
else:
i = arr[i - 1]
The following is the KMP algorithm pseudocode used for searching substring W
, in string S
, in :
# Initialising variables:
i = 0
j = 0
m = len(W)
n = len(S)
# Start search:
while (i < m && j < n){
# Last character matches -> Substring found:
if (W[i] == S[j] && i == m - 1):
return true
# Character matches:
else if (W[i] == S[j]):
i++
j++
# Character didn't match -> Backtrack:
else:
i = arr[i - 1]
if i == 0:
j++
}
# Substring not found:
return false
The algorithm above (given in the form of a pseudocode) along with the failure function, has been implemented in Python:
# Initialise array based on the failure function:def init_arr(w):m = len(w)i = 0j = 1# No proper prefix for string of length 1:global arrarr[0] = 0while j < m:if w[i] == w[j]:i += 1arr[j] = ij += 1# The first character didn't match:elif i == 0:arr[j] = 0j += 1# Mismatch after at least one matching character:else:i = arr[i - 1]def kmp_search(w, s):# Initialise array:init_arr(w)# Initialising variables:i = 0j = 0m = len(w)n = len(s)# Start search:while i < m and j < n:# Last character matches -> Substring found:if w[i] == s[j] and i == m - 1:return True# Character matches:elif w[i] == s[j]:i += 1j += 1# Character didn't match -> Backtrack:else:if i != 0:i = arr[i-1]else:j += 1# Substring not found:return Falsetext = "hayhello"substring = "hell"# create array:arr = [None] * len(substring)if kmp_search(substring, text):print("Substring found!")else:print("Could not find substring.")