What is the Rabin-Karp algorithm?

In bioinformatics, pattern searching is an essential problem. We search for strings in different types of sequences using such algorithms.

The Rabin-Karp algorithm is a string searching or pattern matching algorithm. This algorithm uses the hash function to search the pattern.

Note: To learn more about hashing visit the Answer here.

Example

Let's assume the original string is TT and its length is nn. On the other hand, the string to be searched is PP, and its length is mm.

Spurious hits occur in the searching process. These occur when the hash value of the search pattern is matched with the hash value of the text window, but the window is not the actual pattern.

Let's see the following example to understand the working of the algorithm.

Example of the Rabin-Karp algorithm
1 of 13

Algorithm

The Rabin-Karp algorithm can work with different hash functions. The hash function we use in the above-explained example is to measure the length of the original string (n)(n) and take the mod of the search string by nn. This provides us with the keys to be looked up in each successive window that slides on the original string.

We consider the window size to be the same as the length of the search string (m)(m). Each window has two digits, in this example. We take the mod of the digits in each successive window with nn.

If we get the same result as the one taken at the beginning, i.e., P%nP \% n, we check if it is a spurious hit or an exact hit. This gives us our final index of where the number is found.

Code sample

The code for the Rabin-Karp algorithm is as follows:

number_of_characters = 256
def search(search_string, original_text, q):
n = len(search_string)
m = len(original_text)
i = 0 # index
j = 0 # index
p = 0 # hash key for search_string
t = 0 # hash key for original_text
h = 1
# h would be assigned "pow(number_of_characters, n-1)%q"
for i in range(n-1):
h = (h*number_of_characters)%q
# calculate the hash value of search_string and first occurence of text
for i in range(n):
p = (number_of_characters*p + ord(search_string[i]))%q
t = (number_of_characters*t + ord(original_text[i]))%q
# slide over the search_string character by character
for i in range(m-n+1):
# we check the hash values for the current window of original_text and search_string.
# if the hash value matches then we check for characters one by one
if p==t:
for j in range(n):
if original_text[i+j] != search_string[j]:
break
else: j+=1
# if p == t and search_string[0...n-1] = original_text[i, i+1, ...i+n-1]
if j==n:
print ("search_string found at slide " + str(i))
# we calculate the hash value for the next window of text: Remove
# leading digit, add trailing digit
if i < m-n:
t = (number_of_characters*(t-ord(original_text[i])*h) + ord(original_text[i+n]))%q
# if we encounter negative values of t, we convert it to positive
if t < 0:
t = t+q
# main
original_text = "31415926535"
search_string = "41"
# prime number
q = 101
search(search_string,original_text,q)

Code explanation

In main.py:

  • Line 1: number_of_characters is the number of characters in the input alphabet.

  • Lines 3–42: We use the search function to search the pattern that makes use of the Rabin-Karp algorithm for the given search_string.

This algorithm searches for more than one occurrence of the search_string.

Conclusion

The best and average runtime case for this algorithm is O(n+m)O(n+m). The worst case for this algorithm is O(nm)O(nm). The worst case occurs when the original text TT and the search pattern PP are the same as the hash values of all the substrings. In other words, spurious hits occur several times for all search windows.

Copyright ©2024 Educative, Inc. All rights reserved