...

/

Open Problem: Computing Probabilities of Patterns in a String

Open Problem: Computing Probabilities of Patterns in a String

Learn how we can compute probabilities of patterns in a string by observing different probability patterns.

Overlapping words paradox

In the main text, we told you that the probability that a random DNA string of length 500 contains a 9-mer appearing three or more times is approximately 1/1300. In DETOUR: Probabilities of Patterns in a String, we describe a method to estimate this probability, but it’s rather inaccurate. This open problem is aimed at finding better approximations or even deriving exact formulas for probabilities of patterns in strings.

We start by asking a question: what’s the probability that a specific k-mer Pattern will appear (at least once) as a substring of a random string of length N? This question proved to be not so simple and was first addressed by Solov’ev, 1996 (see also the work by Sedgewick and Flajolet, 2013).

The first surprise is that different k-mers may have different probabilities of appearing in a random string. For example, the probability that Pattern = “01” appears in a random binary string of length 4 is 11/16, while the probability that Pattern = “11” appears in a random binary string of length 4 is 8/16. This phenomenon is called the overlapping words paradox because ...