Assembling Genomes: From Reads to Read-Pairs

Let’s focus on the transformation of reads to read-pairs.

Previously, we described an idealized form of genome assembly in order to build up your intuition about de Bruijn graphs. In the rest of the chapter, we’ll discuss a number of practically motivated topics that will help you appreciate the advanced methods used by modern assemblers.

Reads in genomes

We’ve already mentioned that assembling reads sampled from a randomly generated text is a trivial problem since random strings are not expected to have long repeats. Moreover, de Bruijn graphs become less and less tangled when read length increases (figure below). As soon as read length exceeds the length of all repeats in a genome (provided the reads have no errors), the de Bruijn graph turns into a path. However, despite many attempts, biologists have not yet figured out how to generate long and accurate reads. The most accurate sequencing technologies available today generate reads that are only about 300 nucleotides long, which is too short to span most repeats, even in short bacterial genomes.

We saw earlier that the string TAATGCCATGGGATGTT can’t be uniquely reconstructed from its 3-mer composition since another string (TAATGGGATGCCATGTT) has the same 3-mer composition.

STOP and Think: What additional experimental information would allow you to uniquely reconstruct the string TAATGCCATGGGATGTT ?

Get hands-on with 1200+ tech skills courses.