Solution: Repeated DNA Sequences
Let's solve the Repeated DNA Sequences problem using the Sliding Window pattern.
Statement
Given a string, dna
, that represents a DNA subsequence, and a number , return all the contiguous subsequences (substrings) of length that occur more than once in the string. The order of the returned subsequences does not matter. If no repeated substring is found, the function should return an empty set.
The DNA sequence is composed of a series of nucleotides abbreviated as , , , and . For example, is a DNA sequence. When studying DNA, it is useful to identify repeated subsequences in it.
Constraints:
-
dna.length
dna[i]
is either A, C, G, or T.
Solution
So far, you’ve probably brainstormed some approaches on how to solve this problem. Let’s explore some of these approaches and figure out which one to follow while considering time complexity and any implementation constraints.
Naive approach
A naive approach would be to iterate through the input DNA sequence and add all the unique substrings of length to a set. If a substring is already present in a set, it is a repeated substring.
Here’s how the algorithm works:
-
We iterate the string using a pointer , ranging from to . This is the number of -length substrings present in the sequence.
-
At each iteration, we generate the current -length substring, i.e., dna[i]…dna[i + k - 1].
-
Next, we check if this substring is already present in the set.
-
If it is, the current substring is a repeated sequence, so we add it to our output.
-
Otherwise, the current substring has not yet been repeated, so we just add it to the set.
-
-
We repeat the above process for all -length substrings.
-
Once all -length substrings have been evaluated, we return the output.
The time complexity of this approach is , where is the length of the input sequence and is the size of each contiguous subsequence we consider. This is because we iterate over substrings of length , and at each iteration, the time taken to generate a -length substring is .
The space complexity of this approach is , since in the worst case, our set can contain elements, and at each iteration of the traversal, we are allocating memory to generate a new -length substring.
Optimized approach using sliding window
The problem can be optimized using a sliding window approach. We use the Rabin-Karp algorithm that utilizes a sliding window with
Here’s the basic idea of the algorithm:
-
We traverse the string by using a sliding window of length , which slides one character forward on each iteration.
-
On each iteration, we compute the hash of the current -length substring in the window.
-
We check if the hash is already present in the set.
-
If it is, the substring is repeated, so we add it to the output.
-
Otherwise, the substring has not yet been repeated, so we add the computed hash to the set.
-
-
We repeat the above process for all -length substrings by sliding the window one character forward on each iteration.
-
After all -length substrings have been evaluated, we return the output.
There are multiple approaches for computing hash values, and the choice of the hash function can impact the algorithm’s time complexity. Let’s look at some approaches below.
Hashing and comparison in linear time
Let’s use a simple hashing method that sums the ASCII code of characters present in a window.
Consider the sequence with .
-
Initially, the sequence in the window is and its hash value is:
Since the above hash value has not been repeated yet, we add this hash value to the set and slide the window one character forward.
-
The sequence in the window is now . To compute the hash value of , the ASCII of will be removed from the previous hash value and the ASCII of will then be added:
Since the above hash value has not been repeated yet, we add this hash value to the set and slide the window one character forward.
-
The sequence in the window is now . To compute the hash value of , the ASCII of will be removed from the previous hash value and then again added:
Here, we have the same hash value but different sequences— and . This means that if a hash value is already present in the set, we need to compare the corresponding sequences as well to confirm if they are identical. In this case, they are not, so we add this hash value to the set and slide the window one character forward.
-
The sequence in the window is now . To compute the hash value of , the ASCII of will be removed from the previous hash value and then again added:
Here we have the same hash value, so we compare the two sequences. Since they are identical, the sequence has been repeated and is therefore added to the output.
Computing the hash value and then comparing the strings if the hashes are equal will take linear time, . In the worst case, the comparisons will occur after each slide, which will make the running time the same as that of the naive approach, which is .
Hashing and comparison in constant time
We need a hash function that helps us achieve constant-time hashing. For this purpose, we use the polynomial rolling hash technique:
Here, is a constant, are the characters in a sequence, and is the substring length. Since we only have possible nucleotides, our would be . We also assign numeric values to the nucleotides, as shown in the table below:
Level up your interview prep. Join Educative to access 70+ hands-on prep courses.