Shift-Or string matching algorithm

String matching is a very common operation in programing, and we all have probably used strings.Index() in Go, strstr() in C, or String.indexOf() in JavaScript.

Naïve approach

A naïve string matching implementation would require two nested loops, one iterating over the whole text, and another over the matching patter until the complete pattern is found in the text:

func Index(text, pattern string) int {
outer:
    // iterating over whole text
	for i := 0; i < len(text)-len(pattern); i++ {
       // iterating over pattern
		for j := 0; j < len(pattern); j++ {
			if text[i+j] != pattern[j] {
				continue outer
			}
		}
		return i
	}
	return -1
}

Such an implementation, however, has terrible performance since it has to perform len(text)*len(pattern) comparisons in the worst case. This is known as O(nm)O(n·m) complexity and is often discouraged.

Optimized approach

People have been trying to find more optimal solutions for typical use cases of string matching. One such example is bitap, also known as the Shift-Or/Shift-And algorithm.

The idea of the algorithm is to use bitmasks, where bits correspond to the elements (letters) in the pattern. The algorithm has a few constraints:

  • Number of letters in the alphabet has to be known.
  • The length of the search pattern should be short enough to fit into one machine wordA machine word is the size (in bits) of data that the machine’s CPU can handle optimally..

For modern systems, we can use an alphabet of 256 items (bytes) and restring pattern length of 63 bytes to fit into a uint64 type.

For the given search pattern, we will need a mask table with 256 records for each possible byte value. Each record would be a uint64, where set bit represents the given byte found at the given position in the pattern. For example, if we are looking for “hello” substring (bits are normally counted right-to-left):

              olleh
mask['h'] = 0b00001
mask['e'] = 0b00010
mask['l'] = 0b01100
mask['o'] = 0b10000

Matching starts with a zero “state” bitmask (also a uint64) and iterates over each symbol in the given text. The state mask is shifted left by one bit and the lowest bit is set to 1. Then, the state is bitwise multiplied by the mask record from the table we built at the start of the algorithm for the given letter. If the letter matches, the bit is preserved; otherwise, it gets cleared. If, after all these bitwise operations, the state has a 1 bit at the highest position (i.e., for a five-letter pattern, at the fifth bit), it’s a match.

Implementation

The implementation for this algorithm looks like this:

func Index(text, pattern string) int {
	state := uint64(0)
	mask := [256]uint64{}
	m := len(pattern)
	// Pattern is too short or too long
	if m == 0 || m > 63 {
		return -1
	}
	// Initialize mask table for each letter in the pattern
	for i := 0; i < len(pattern); i++ {
		mask[pattern[i]] = mask[pattern[i]] | (1 << uint(i))
	}
	for i := 0; i < len(text); i++ {
		// Update state by shifting it and masking with the record from table 
                state = state << 1 + 1
		state = state & mask[text[i]]
		
		if state&(1<<uint(m-1)) != 0 {
			// It's a match!
			return i - len(pattern) + 1
		}
	}
	// No match found
	return -1
}

In this code, bit-shit operators ( << ) are used. You can learn more about that here.

package main
func Index(text, pattern string) int {
state := uint64(0)
mask := [256]uint64{}
m := len(pattern)
// Pattern is too short or too long
if m == 0 || m > 63 {
return -1
}
// Initialize mask table for each letter in the pattern
for i := 0; i < len(pattern); i++ {
mask[pattern[i]] = mask[pattern[i]] | (1 << uint(i))
}
for i := 0; i < len(text); i++ {
// Update state by shifting it and masking with the record from table
state = state<<1 + 1
state = state & mask[text[i]]
if state&(1<<uint(m-1)) != 0 {
// It's a match!
return i - len(pattern) + 1
}
}
// No match found
return -1
}
func main() {
pos := Index("helloworld", "low")
println("'low' in 'helloworld' found at position", pos)
}

Time complexity

The algorithm has better performance than the naive approach for short patterns typical of the natural language. Also, the algorithm can be modified to do fuzzy matching and has been implemented in agrep utility.

But don’t think that Go standard library is too dumb to use the naïve implementation. In fact, it’s highly optimized – if you look at the sources, you will see many edge cases considered for better performance, including the Rabin–Karp algorithm, which uses the rolling hash. A better understanding of algorithms always pays out by either optimizing the performance of your app or impressing people at your next job interview.