Detour: Probabilities of Patterns in a String
Explore how probabilities of patterns in a string are calculated, how changing the pattern changes the probability, and what the overlapping words paradox is.
We mentioned that the probability that some 9-mer appears 3 or more times in a random DNA string of length 500 is approximately 1/1300. We assure you that this calculation doesn’t appear out of thin air. Specifically, we can generate a random string modeling a DNA strand by choosing each nucleotide for any position with a probability 1/4. The construction of random strings can be generalized to an arbitrary alphabet with A symbols, where each symbol is chosen with probability 1/A.
Exercise Break: What is the probability that two randomly generated strings of length n in an A-letter alphabet are identical?
Now, there’s a simple question: what’s the probability that a specific k-mer Pattern will appear (at least once) as a substring of a random string of length N? For example, say that we want to find the probability that “01” appears in a random binary string (A = 2) of length 4. Here are all possible such strings:
Probability That “01” Appears in a Random Binary String
0000 | 0001 | 0010 | 0011 | 0100 | 0101 | 0110 | 0111 |
1000 | 1001 | 1001 | 1011 | 1100 | 1101 | 1110 | 1111 |
Because “01” is a substring of 11 of these 4-mers, and because each 4-mer could be generated with probability 1/16, the probability that “01” appears in a random binary 4-mer is 11/16.
STOP and Think: What’s the probability that Pattern = “11” appears as a substring of a random binary 4-mer?
Changing pattern changes the probability
Surprisingly, changing Pattern from “01” to “11” changes the probability that it appears as a substring of a random binary string. Indeed, “11” appears in only 8 binary 4-mers:
Changing Pattern from “01” to “11”
0000 | 0001 | 0010 | 0011 | 0100 | 0101 | 0110 | 0111 |
1000 | 1001 | 1001 | 1011 | 1100 | 1101 | 1110 | 1111 |
As a result, the probability of “11” appearing in a random binary string of length 4 is 8/16 = 1/2.
STOP and Think: Why do you think that “11” is less likely than “01” to appear as a substring of a random binary 4-mer?
Computing probability
Let Pr(N, A, Pattern, t) denote the probability that a string Pattern appears t or more times in a random string of length N formed from an alphabet of A letters. We saw that Pr(4, 2, “01,” 1) = 11/16 while Pr(4, 2, “11,” 1) = 1/2. Interestingly, when we make t greater than 1, we see that “01” is less likely to appear multiple times than “11.” For example, the probability of finding “01” twice or more in a random binary 4-mer is given by Pr(4, 2, “01,” 2) = 1/16 because “0101” is the only binary 4-mer containing “01” twice, and yet Pr(4, 2, “11,” 2) = 3/16 because binary 4-mers “0111,” “1110,” and “1111” all have at least two occurrences of “11.”
Exercise Break:
Compute Pr(100, 2, “01,” 1).
Overlapping words paradox
We’ve seen that different k-mers have different probabilities of occurring multiple times as a substring of a random string. In general, this phenomenon is called the overlapping words paradox because different substring occurrences of Pattern can overlap each other for some choices of Pattern but not others (see Detour: The Overlapping Words Paradox).
For example, there are two overlapping occurrences of “11” in “1110” and three overlapping occurrences of “11” in “1111”; yet occurrences of “01” can never overlap with each other, and so “01” can never occur more than twice in a binary 4-mer. The overlapping words paradox makes computing Pr(N, A, Pattern, t) a rather complex problem because this probability depends heavily on the particular choice of Pattern. In light of the complications presented by the overlapping words paradox, we’ll try to approximate Pr(A, N, Pattern, t) rather than compute it exactly.
Approximating strings
To approximate Pr(N, A, Pattern, t), we’ll assume that the k-mer Pattern is not overlapping. As a toy example, say we wish to count the number of ternary strings (A = 3) of length 7 that contain “01” at least twice. Apart from the two occurrences of “01,” we have three remaining symbols in the string. Let’s assume that these symbols are all “2.” The two occurrences of “01” can be inserted into “222” in ten different ways to form a 7-mer, as shown below.
The Two Occurrences of “01” Can Be Inserted into “222” in Ten Different Ways to Form a 7-mer
0101222 | 0120122 | 0122012 | 0122201 | 2010122 |
2012012 | 2012201 | 2201012 | 2201201 | 2220101 |
We inserted these two occurrences of “01” into “222,” but we could’ve inserted them into any other ternary 3-mer. Because there are 3 = 27 ternary 3-mers, we obtain an approximation of 10 · 27 = 270 for the number of ternary 7-mers that contain two or more instances of “01.” Because there are 3 = 2187 ternary 7-mers, we estimate the probability Pr(7, 3, “01,” 2) as 270/2187.
STOP and Think: Is 270/2187 a good approximation for Pr(7, 3, “01,” 2)? Is the true probability Pr(7, 3, “01,” 2) larger or smaller than 270/2187?
To generalize the above method to approximate Pr(N, A, Pattern, t) for arbitrary parameter values, consider a string Text of length N having at least t occurrences of a k-mer Pattern. If we select exactly t of these occurrences, then we can think about Text as a sequence of symbols interrupted by t insertions of the k-mer Pattern. If we fix these n symbols, then we wish to count the number of different strings Text that can be formed by inserting t occurrences of Pattern into a string formed by these n symbols.
For example, consider again the problem of embedding two occurrences of “01” into “222” (n = 3), and note that we’ve added five copies of a capital “X” below each 7-mer.
Embedding Two Occurrences of “01” into “222”
0101222 | 0120122 | 0122012 | 0122201 | 2010122 |
XX XXX | X XX XX | X XXX X | X XXXX | XX X XX |
2010112 | 2012201 | 2201012 | 2201201 | 2220101 |
XX XX X | XX XXX | XXX X X | XXX XX | XXXX X |
What do the “X” mean? Instead of counting the number of ways to insert two occurrences of “01” into “222,” we can count the number of ways to select two of the five “X” to color blue.
What Does the X Mean?
XXXXX | XXXXX | XXXXX | XXXXX | XXXXX | 0111 |
XXXXX | XXXXX | XXXXX | XXXXX | XXXXX | 1111 |
In other words, we’re counting the number of ways to choose 2 out of 5 objects, which can be counted by the binomial coefficient = 10. More generally, the binomial coefficient ...