Charging Station: Solving the Clump Finding Problem

Let’s have a look at the solutions to the clump finding problem.

We'll cover the following

This lesson assumes that you’ve read Charging Station: The Frequency Array.

The pseudocode below slides a window of length L down Genome. After computing the frequency array for the current window, it identifies (L, t)-clumps simply by finding which k-mers occur at least t times within the window. To keep track of these clumps, our algorithm uses an array Clump of length 4k^{k} whose values are all initialized to zero. For each value of i between 0 and 4k^{k} − 1, we’ll set Clump(i) equal to 1 if NumberToPattern(i, k) forms an (L, t)-clump in Genome.

 ClumpFinding(Genome, k, L, t) 
   FrequentPatterns ← an empty set 
   for i ← 0 to 4^k − 1
       CLUMP(i) ← 0
   for i ← 0 to |Genome|−L
       Text ← the string of length L starting at position i in Genome 
       FrequencyArray ← ComputingFrequencies(Text, k)
       for index ← 0 to 4^k − 1
            if FREQUENCYARRAY(index) ≥ t 
                CLUMP(index) ← 1
   for i ← 0 to 4^k − 1
       if CLUMP(i) = 1 
           Pattern ← NUMBERTOPATTERN(i, k)  
           add Pattern to the set FrequentPatterns
   return FrequentPatterns

Get hands-on with 1400+ tech skills courses.