...

/

Detour: Probabilities of Patterns in a String

Detour: Probabilities of Patterns in a String

Explore how probabilities of patterns in a string are calculated, how changing the pattern changes the probability, and what the overlapping words paradox is.

We mentioned that the probability that some 9-mer appears 3 or more times in a random DNA string of length 500 is approximately 1/1300. We assure you that this calculation doesn’t appear out of thin air. Specifically, we can generate a random string modeling a DNA strand by choosing each nucleotide for any position with a probability 1/4. The construction of random strings can be generalized to an arbitrary alphabet with A symbols, where each symbol is chosen with probability 1/A.

Exercise Break: What is the probability that two randomly generated strings of length n in an A-letter alphabet are identical?

Now, there’s a simple question: what’s the probability that a specific k-mer Pattern will appear (at least once) as a substring of a random string of length N? For example, say that we want to find the probability that “01” appears in a random binary string (A = 2) of length 4. Here are all possible such strings:

Probability That “01” Appears in a Random Binary String









0000

0001

0010

0011

0100

0101

0110

0111

1000

1001

1001

1011

1100

1101

1110

1111

Because “01” is a substring of 11 of these 4-mers, and because each 4-mer could be generated with probability 1/16, the probability that “01” appears in a random binary 4-mer is 11/16.

STOP and Think: What’s the probability that Pattern = “11” appears as a substring of a random binary 4-mer?

Changing pattern changes the probability

Surprisingly, changing Pattern from “01” to “11” changes the probability that it appears as a substring of a random binary string. Indeed, “11” appears in only 8 binary 4-mers:

Changing Pattern from “01” to “11”









0000

0001

0010

0011

0100

0101

0110

0111

1000

1001

1001

1011

1100

1101

1110

1111

As a result, the probability of “11” appearing in a random binary string of length 4 is 8/16 = 1/2.

STOP and Think: Why do you think that “11” is less likely than “01” to appear as a substring of a random binary 4-mer?

Computing probability

Let Pr(N, A, Pattern, t) denote the probability that a string Pattern appears t or more times in a random string of length N formed from an alphabet of A letters. We saw that Pr(4, 2, “01,” 1) = 11/16 while Pr(4, 2, “11,” 1) = 1/2. Interestingly, when we make t greater than 1, we see that “01” is less likely to appear multiple times than “11.” For example, the probability of finding “01” twice or more in a random binary 4-mer is given by Pr(4, 2, “01,” 2) = 1/16 because “0101” is the only binary 4-mer containing “01” twice, and yet Pr(4, 2, “11,” 2) = 3/16 because binary 4-mers “0111,” “1110,” and “1111” all have at least two occurrences of “11.”

Exercise Break: Compute Pr(100, 2, “01,” 1).

Overlapping words paradox

We’ve seen that different k-mers have different probabilities of occurring multiple times as a substring of a random string. In general, this phenomenon is called the overlapping words paradox because different substring occurrences of Pattern can overlap each other for some choices of Pattern but not others (see Detour: The Overlapping Words Paradox).

For example, there are two overlapping occurrences of “11” in “1110” and three overlapping occurrences of “11” in “1111”; yet occurrences of “01” can never overlap with each other, and so “01” can never occur more than twice in a binary 4-mer. The overlapping words paradox makes computing Pr(N, A, Pattern, t) a rather complex problem because this probability depends heavily on the particular choice of Pattern. In light of the complications presented by the overlapping words paradox, we’ll try to approximate Pr(A, N, Pattern, t) rather than compute it exactly.

Approximating strings

To approximate Pr(N, A, Pattern, t), we’ll assume that the k-mer Pattern is not overlapping. As a toy example, say we wish to count the number of ternary strings (A = 3) of length 7 that contain “01” at least twice. Apart from the two occurrences of “01,” we have three remaining symbols in the string. Let’s assume that these symbols are all “2.” The two occurrences of “01” can be inserted into “222” in ten different ways to form a 7-mer, as shown below.

The Two Occurrences of “01” Can Be Inserted into “222” in Ten Different Ways to Form a 7-mer






0101222

0120122

0122012

0122201

2010122

2012012

2012201

2201012

2201201

2220101

We inserted these two occurrences of “01” into “222,” but we could’ve inserted them into any other ternary 3-mer. Because there are 33^{3} = 27 ternary 3-mers, we obtain an approximation of 10 · 27 = 270 for the number of ternary 7-mers that contain two or more instances of “01.” Because there are 37^{7} = 2187 ternary 7-mers, we estimate the probability Pr(7, 3, “01,” 2) as 270/2187.

STOP and Think: Is 270/2187 a good approximation for Pr(7, 3, “01,” 2)? Is the true probability Pr(7, 3, “01,” 2) larger or smaller than 270/2187?

To generalize the above method to approximate Pr(N, A, Pattern, t) for arbitrary parameter values, consider a string Text of length N having at least t occurrences of a k-mer Pattern. If we select exactly t of these occurrences, then we can think about Text as a sequence of n=Ntkn = N − t · k symbols interrupted by t insertions of the k-mer Pattern. If we fix these n symbols, then we wish to count the number of different strings Text that can be formed by inserting t occurrences of Pattern into a string formed by these n symbols.

For example, consider again the problem of embedding two occurrences of “01” into “222” (n = 3), and note that we’ve added five copies of a capital “X” below each 7-mer.

Embedding Two Occurrences of “01” into “222”






0101222

0120122

0122012

0122201

2010122

XX XXX

X XX XX

X XXX X

X XXXX

XX X XX

2010112

2012201

2201012

2201201

2220101

XX XX X

XX XXX

XXX X X

XXX XX

XXXX X

What do the “X” mean? Instead of counting the number of ways to insert two occurrences of “01” into “222,” we can count the number of ways to select two of the five “X” to color blue.

What Does the X Mean?







XXXXX

XXXXX

XXXXX

XXXXX

XXXXX

0111

XXXXX

XXXXX

XXXXX

XXXXX

XXXXX

1111

In other words, we’re counting the number of ways to choose 2 out of 5 objects, which can be counted by the binomial coefficient (52)\binom{5}{2} = 10. More generally, the binomial coefficient (mk)\binom{m}{k} ...