Breaking reads into k-mers

Our discussion of genome assembly has thus far relied upon various assumptions. Accordingly, applying de Bruijn graphs to real sequencing data is not a straightforward procedure. Below, we describe practical challenges introduced by quirks in modern sequencing technologies and some computational techniques that have been devised to address these challenges. In this discussion, we’ll first assume that reads are generated as contiguous substrings of a genome instead of read-pairs for the sake of simplicity.

Illumina sequencing technology

Given a k-mer substring of a genome, we define its coverage as the number of reads to which this k-mer belongs. We’ve taken for granted that a sequencing machine can generate all k-mers present in the genome, but this assumption of perfect k-mer coverage doesn’t hold in practice. For example, the popular Illumina sequencing technology generates reads that are approximately 300 nucleotides long, but this technology still misses many 300-mers present in the genome (even if the average coverage is very high), and nearly all the reads that it does generate have sequencing errors.

STOP and Think: Given a set of reads having imperfect k-mer coverage, can you find a parameter l<kl < k so that the same reads have perfect l-mer coverage? What is the maximum value of this parameter?

The figure below (left) shows four 10-mer reads that capture some but not all of the 10-mers in an example genome. However, if we take the counterintuitive step of breaking these reads into shorter 5-mers (Figure 3.37, right), then these 5-mers exhibit perfect coverage. This read breaking approach, in which we break reads into shorter k-mers, is used by many modern assemblers.

Get hands-on with 1200+ tech skills courses.