String matching is a very common operation in programing, and we all have probably used strings.Index()
in Go, strstr()
in C, or String.indexOf()
in JavaScript.
A naïve string matching implementation would require two nested loops, one iterating over the whole text, and another over the matching patter until the complete pattern is found in the text:
func Index(text, pattern string) int {
outer:
// iterating over whole text
for i := 0; i < len(text)-len(pattern); i++ {
// iterating over pattern
for j := 0; j < len(pattern); j++ {
if text[i+j] != pattern[j] {
continue outer
}
}
return i
}
return -1
}
Such an implementation, however, has terrible performance since it has to perform len(text)*len(pattern)
comparisons in the worst case. This is known as complexity and is often discouraged.
People have been trying to find more optimal solutions for typical use cases of string matching. One such example is bitap, also known as the Shift-Or/Shift-And algorithm.
The idea of the algorithm is to use bitmasks, where bits correspond to the elements (letters) in the pattern. The algorithm has a few constraints:
For modern systems, we can use an alphabet of 256 items (bytes) and restring pattern length of 63 bytes to fit into a uint64 type.
For the given search pattern, we will need a mask table with 256 records for each possible byte value. Each record would be a uint64, where set bit represents the given byte found at the given position in the pattern. For example, if we are looking for “hello” substring (bits are normally counted right-to-left):
olleh
mask['h'] = 0b00001
mask['e'] = 0b00010
mask['l'] = 0b01100
mask['o'] = 0b10000
Matching starts with a zero “state” bitmask (also a uint64) and iterates over each symbol in the given text. The state mask is shifted left by one bit and the lowest bit is set to 1. Then, the state is bitwise multiplied by the mask record from the table we built at the start of the algorithm for the given letter. If the letter matches, the bit is preserved; otherwise, it gets cleared. If, after all these bitwise operations, the state has a 1 bit at the highest position (i.e., for a five-letter pattern, at the fifth bit), it’s a match.
The implementation for this algorithm looks like this:
func Index(text, pattern string) int {
state := uint64(0)
mask := [256]uint64{}
m := len(pattern)
// Pattern is too short or too long
if m == 0 || m > 63 {
return -1
}
// Initialize mask table for each letter in the pattern
for i := 0; i < len(pattern); i++ {
mask[pattern[i]] = mask[pattern[i]] | (1 << uint(i))
}
for i := 0; i < len(text); i++ {
// Update state by shifting it and masking with the record from table
state = state << 1 + 1
state = state & mask[text[i]]
if state&(1<<uint(m-1)) != 0 {
// It's a match!
return i - len(pattern) + 1
}
}
// No match found
return -1
}
In this code, bit-shit operators (
<<
) are used. You can learn more about that here.
package mainfunc Index(text, pattern string) int {state := uint64(0)mask := [256]uint64{}m := len(pattern)// Pattern is too short or too longif m == 0 || m > 63 {return -1}// Initialize mask table for each letter in the patternfor i := 0; i < len(pattern); i++ {mask[pattern[i]] = mask[pattern[i]] | (1 << uint(i))}for i := 0; i < len(text); i++ {// Update state by shifting it and masking with the record from tablestate = state<<1 + 1state = state & mask[text[i]]if state&(1<<uint(m-1)) != 0 {// It's a match!return i - len(pattern) + 1}}// No match foundreturn -1}func main() {pos := Index("helloworld", "low")println("'low' in 'helloworld' found at position", pos)}
The algorithm has better performance than the naive approach for short patterns typical of the natural language. Also, the algorithm can be modified to do fuzzy matching and has been implemented in agrep
utility.
But don’t think that Go standard library is too dumb to use the naïve implementation. In fact, it’s highly optimized – if you look at the sources, you will see many edge cases considered for better performance, including the Rabin–Karp algorithm, which uses the rolling hash. A better understanding of algorithms always pays out by either optimizing the performance of your app or impressing people at your next job interview.