Epilogue: Genome Assembly Faces Real Sequencing Data

Learn about practical challenges introduced by modern sequencing technologies and some techniques to address them.

We'll cover the following...

Breaking reads into k-mers
- Illumina sequencing technology
Splitting the genome into contigs
- Solution explanation
- Non-branching
Charging Station: Maximal Non-Branching Paths in a Graph
Assembling error-prone reads
- Bubble removal
Inferring multiplicities of edges in de Bruijn graphs
- Practical considerations

Breaking reads into k-mers

Our discussion of genome assembly has thus far relied upon various assumptions. Accordingly, applying de Bruijn graphs to real sequencing data is not a straightforward procedure. Below, we describe practical challenges introduced by quirks in modern sequencing technologies and some computational techniques that have been devised to address these challenges. In this discussion, we’ll first assume that reads are generated as contiguous substrings of a genome instead of read-pairs for the sake of simplicity.

Illumina sequencing technology

Given a k-mer substring of a genome, we define its coverage as the number of reads to which this k-mer belongs. We’ve taken for granted that a sequencing machine can generate all k-mers present in the genome, but this assumption of perfect k-mer coverage doesn’t hold in practice. For example, the popular Illumina sequencing technology generates reads that are approximately 300 nucleotides long, but this technology still misses many 300-mers present in the genome (even if the average coverage is very high), and nearly all the reads that it does generate have sequencing errors.

STOP and Think: Given a set of reads having imperfect k-mer coverage, can you find a parameter $l < k$ so that the same reads have perfect l-mer coverage? What is the maximum value of this parameter?

The figure below (left) shows four 10-mer reads that capture some but not all of the 10-mers in an example genome. However, if we take the counterintuitive step of breaking these reads into shorter 5-mers (Figure 3.37, right), then these 5-mers exhibit perfect coverage. This read breaking approach, in which we break reads into shorter k-mers, is used by many modern assemblers.

Before Getting Started

Where in the Genome Does DNA Replication Begin?

DNA Replication: Open Problems, Charging Stations, and Detours

How Do We Assemble Genomes?

Assemble Genomes: Charging Stations, and Detours

How Do We Compare Biological Sequences?

Biological Sequences: Detours

Conclusion

Epilogue: Genome Assembly Faces Real Sequencing Data

Breaking reads into k-mers

Illumina sequencing technology

Splitting the genome into contigs