...

/

Space Efficient Sequence Alignment

Space Efficient Sequence Alignment

Let’s look at a space efficient solution to sequence alignment.

To introduce fitting alignments, we used the example of aligning a 20,000 amino acidlong NRP synthetase from Bacillus brevis against a 600 amino acid-long A-domain from Streptomyces roseosporus. However, you may not be able to construct this alignment on your computer because the memory required to store the dynamic programming matrix is substantial.

Computing alignment score using linear memory

The runtime of the dynamic programming algorithm for aligning two strings of lengths n and m is proportional to the number of edges in their alignment graph, which is O(nm)O(n · m). The memory required by this algorithm is also O(nm)O(n · m), since we need to store the backtracking references. We’ll now demonstrate how to construct an alignment in just O(n)O(n) space at the expense of doubling the runtime, meaning that the runtime is still O(nm)O(n · m). In this section, for the sake of simplicity, we’ll consider the case of global alignment with non-affine gap penalties, i.e., when each symbol in a gap contributes the same score.

Divide-and-conquer algorithm

A divide-and-conquer algorithm often works when a solution to a large problem can be constructed from solutions of smaller problem instances. Such a strategy proceeds in two phases. The divide phase splits a problem instance into smaller instances and solves them; the conquer phase stitches the smaller solutions into a solution to the original problem. To learn more, see DETOUR: Divide-and-Conquer Algorithms.

Before we proceed with the divide-and-conquer algorithm for linear-space alignment, note that if we only wish to compute the score of an alignment rather than the alignment itself, then the space required can easily be reduced to just twice the number of nodes in a single column of the alignment graph, or O(n)O(n). This reduction derives from the observation that the only values needed to compute the alignment scores in column j are the alignment scores in column j1j - 1. Therefore, the alignment scores in the columns before column j1j - 1 can be discarded when computing the alignment scores for column j, as illustrated in the figure below. Unfortunately, finding the longest path requires us to store an entire matrix of backtracking pointers, which causes the quadratic space requirement. The idea of the space reduction technique is that we don’t need to store any backtracking pointers if we’re willing to spend a little more time. In fact, we’ll show how to construct an alignment without storing any backtracking pointers.

The Middle Node Problem

Given strings v{v} = v1{v}_{1}vn{v}_{n} and w{w} = w1{w}_{1}wm{w}_{m} ...