Solution: Repeated DNA Sequences
Let's solve the Repeated DNA Sequences problem using the Sliding Window pattern.
Statement
Given a string, dna
, that represents a DNA subsequence, and a number , return all the contiguous subsequences (substrings) of length that occur more than once in the string. The order of the returned subsequences does not matter. If no repeated substring is found, the function should return an empty set.
The DNA sequence is composed of a series of nucleotides abbreviated as , , , and . For example, is a DNA sequence. When studying DNA, it is useful to identify repeated subsequences in it.
Constraints:
-
dna.length
dna[i]
is either A, C, G, or T.
Solution
So far, you’ve probably brainstormed some approaches on how to solve this problem. Let’s explore some of these approaches and figure out which one to follow while considering time complexity and any implementation constraints.
Naive approach
A naive approach would be to iterate through the input DNA sequence and add all the unique substrings of length to a set. If a substring is already present in a set, it is a repeated substring.
Here’s how the algorithm works:
-
We iterate the string using a pointer , ranging from to . This is the number of -length substrings present in the sequence.
-
At each iteration, we generate the current -length substring, i.e., dna[i]…dna[i + k - 1].
-
Next, we check if this substring is already present in the set.
-
If it is, the current substring is a repeated sequence, so we add it to our output.
-
Otherwise, the current substring has not yet been repeated, so we just add it to the set.
-
-
We repeat the above process for all -length substrings.
-
Once all -length substrings have been evaluated, we return the output.
The time complexity of this approach is , where is the length of the input sequence and is the size of each contiguous subsequence we consider. This is because we iterate over substrings of length , and at each iteration, the time taken to generate a -length substring is .
The space complexity of this approach is , since in the worst case, our set can contain elements, and at each iteration of the traversal, we are allocating memory to generate a new -length substring.
Optimized approach using sliding window
The problem can be optimized using a sliding window approach. We use the Rabin-Karp algorithm that utilizes a sliding window with
Here’s the basic idea of the algorithm:
-
We traverse the string by using a sliding window of length , which slides one character forward on each iteration.
-
On each iteration, we compute the hash of the current -length substring in the window.
-
We check if the hash is already present in the set.
-
If it is, the substring is repeated, so we add it to the output.
-
Otherwise, the substring has not yet been repeated, so we add the computed hash to the set.
-
-
We repeat the above process for all -length substrings by sliding the window one character forward on each iteration.
-
After all -length substrings have been evaluated, we return the output.
There are multiple approaches for computing hash values, and the choice of the hash function can impact the algorithm’s time complexity. Let’s look at some approaches below.
Hashing and comparison in linear time
Let’s use a simple hashing method that sums the ASCII code of characters present in a window.
Consider the sequence with .
-
Initially, the sequence in the window is and its hash value is:
Since the above hash value has not been repeated yet, we add this hash value to the set and slide the window one character forward.
-
The sequence in the window is now . To compute the hash value of , the ASCII of ...