Feature #8: Similarity Measure Between DNA Samples
Implement the "Similarity Measure Between DNA Samples" feature for our "Computational Biology" project.
Description
The DNA of an alien species consists of a sequence of nucleotides, where each nucleotide is represented by a letter. We received two such DNA samples, and we need to measure the extent of similarity—also known as the edit distance—between them. The prevalent measure of similarity between these two samples is the minimum number of edits that are required to convert one DNA sample to the other.
Note: We are only permitted to insert, delete, or update a nucleotide in a DNA sample.
Given two DNA samples as strings, sample1
and sample2
, we have to return the minimum number of operations that are required to convert sample1
to sample2
.
The following examples may clarify this problem:
Solution
We’ll compare the sample1
string with the sample2
string, one character at a time. If the characters at the current position match, then no edit operation will be required.
On the other hand, if the characters at the current position in the two strings don’t match, we can perform one of the following three operations, whichever will result in the fewest edit operations:
- Insert a character to
sample1
at the current position. - Delete the character in
sample1
at the current position. - Replace the character in
sample1
at the current position with the character at the current position insample2
.
The choice shown above can’t be made on the basis of local optimality. The actual cost of all of the choices given above must consider the edit cost for matching the rest of the characters in sample1
with the rest of the characters in sample2
.
Let’s look at an example where we compare sample1 = abcdef
with sample2 = azced
. If we start the comparison on the first characters ...
Level up your interview prep. Join Educative to access 70+ hands-on prep courses.