...

/

Some Hidden Messages are More Elusive than Others

Some Hidden Messages are More Elusive than Others

Get a brief introduction to Hamming distance, approximate pattern matching, frequent words with mismatches, and reverse complements.

Minimum Skew Problem from the previous lesson now provides us with an approximate location of ori at position 3923620 in E. coli.

In an attempt to confirm this hypothesis, let’s look for a hidden message representing a potential DnaA box near this location. Solving the Frequent Words Problem in a window of length 500 starting at position 3923620 (shown below) reveals no 9-mers (along with their reverse complements) that appear three or more times! Even if we’ve located ori in E. coli, it appears that we still haven’t found the DnaA boxes that jump-start replication in this bacterium.

Press + to interact

STOP and Think: What would you do next?

Approximate occurrences of k-mers

Before we give up, let’s examine the ori of Vibrio cholerae one more time to see if it provides us with any insights on how to alter our algorithm to find DnaA boxes in E. coli. You may have noticed that in addition to the three occurrences of ATGATCAAG and three occurrences of its reverse complement CTTGATCAT, the Vibrio cholerae ori contains additional occurrences of ATGATCAAC and CATGATCAT, which differ from ATGATCAAG and CTTGATCAT in only a single nucleotide:

Press + to interact

Finding eight approximate occurrences of our target 9-mer and its reverse complement in a short region is even more statistically surprising than finding the six exact occurrences of ATGATCAAG and its reverse complement CTTGATCAT that we stumbled upon at the beginning of our investigation. Furthermore, the discovery of these approximate 9-mers makes sense biologically, since DnaA can bind not only to “perfect” DnaA boxes but to their slight variations as well.

We say that position ii in kk-mers p1_{1} · · · pk_{k} and q1_{1} · · · qk_{k} ...