Genome assembly is more difficult than you think

Before we introduce a computational problem modeling genome assembly, we’ll take a moment to discuss a few practical complications that make genome assembly more difficult than the Newspaper Problem.

First, DNA is double-stranded, and we have no way of knowing a priori which strand a given read derives from, meaning that we’ll not know whether to use a read or its reverse complement when assembling a particular strand of a genome.
Second, modern sequencing machines are not perfect, and the reads that they generate often contain errors. Sequencing errors complicate genome assembly because they prevent us from identifying all overlapping reads.
Third, some regions of the genome may not be covered by any reads, making it impossible to reconstruct the entire genome.

Since the reads generated by modern sequencers often have the same length, we may safely assume that reads are all k-mers for some value of k. The first part of this chapter will assume an ideal — and unrealistic — situation in which all reads come from the same strand, have no errors, and exhibit perfect coverage, so that every k-mer substring of the genome is generated as a read. Later, we’ll show how to relax these assumptions for more realistic datasets.

Reconstructing strings from k-mers

We’re now ready to define a computational problem modeling genome assembly. Given a string Text, its k-mer composition Composition $_k$ (Text) is the collection of all k-mer substrings of Text (including repeated k-mers). For example:

Composition $_{3}$ (TATGGGGTGC) = {ATG,GGG,GGG,GGT,GTG,TAT,TGC,TGG}.

Note that we’ve listed k-mers in lexicographic order (i.e., how they would appear in a dictionary) rather than in the order of their appearance in TATGGGGTGC. We’ve done this because the correct ordering of reads is unknown when they’re generated.

Press + to interact

Solution explanation

Line 5: We define kmers as an empty array.
Lines 6–7: We iterate over text.
- Line 7: We add every k-mer with length k using the i:i+k to kmers.

Solving the String Composition Problem is a straightforward exercise, but in order to model genome assembly, we need to solve its inverse problem.

String Reconstruction Problem Problem overview:

Reconstruct a string from its k-mer composition.

Input: An integer k and a collection Patterns of k-mers.

Output: A string Text with k-mer composition equal to Patterns (if such a string exists).

Before we ask you to solve the String Reconstruction Problem, let’s consider the following example of a 3-mer composition:

AAT ATG GTT TAA TGT

The most natural way to solve the String Reconstruction Problem is to mimic the solution of the Newspaper Problem and “connect” a pair of k-mers if they overlap in k − 1 symbols.

For the above example, it’s easy to see that the string should start with TAA because there’s no 3-mer ending in TA. This implies that the next 3-mer in the string should start with AA. There’s only one 3-mer satisfying this condition, AAT:

TAA
$\space$ $\space$ AAT

In turn, AAT can only be extended by ATG, which can only be extended by TGT, and so on, leading us to reconstruct TAATGTT:

TAA
$\space$ $\space$ AAT
$\space$ $\space$ $\space$ $\space$ $\space$ ATG
$\space$ $\space$ $\space$ $\space$ $\space$ $\space$ $\space$ $\space$ TGT
$\space$ $\space$ $\space$ $\space$ $\space$ $\space$ $\space$ $\space$ $\space$ $\space$ $\space$ GTT
TAATGTT

It looks like we’re finished with the String Reconstruction Problem and can let you move on to the next lesson. To be sure, let’s consider another 3-mer composition:

AAT $\space$ ATG $\space$ ATG $\space$ ATG $\space$ CAT $\space$ CCA $\space$ GAT $\space$ GCC $\space$ GGA $\space$ GGG $\space$ GTT $\space$ TAA $\space$ TGC $\space$ TGG $\space$ TGT

Exercise Break: Reconstruct a string with this 3-mer composition.

If we start again with TAA, then the next 3-mer in the string should start with AA, and there’s only one such 3-mer, AAT. In turn, AAT can only be extended by ATG:

TAA
$\space$ $\space$ AAT
$\space$ $\space$ $\space$ $\space$ $\space$ ATG
TAATG

ATG can be extended either by TGC, or TGG, or TGT. Now we must decide which of these 3-mers to choose. Let’s select TGT:

TAA
$\space$ $\space$ AAT
$\space$ $\space$ $\space$ $\space$ $\space$ ATG
$\space$ ...

Before Getting Started

Where in the Genome Does DNA Replication Begin?

DNA Replication: Open Problems, Charging Stations, and Detours

How Do We Assemble Genomes?

Assemble Genomes: Charging Stations, and Detours

How Do We Compare Biological Sequences?

Biological Sequences: Detours

Conclusion

The String Reconstruction Problem

Genome assembly is more difficult than you think

Reconstructing strings from k-mers

Solution explanation