...

/

Epilogue: Multiple Sequence Alignment

Epilogue: Multiple Sequence Alignment

Learn multiple sequence alignment with the help of a three-dimensional Manhattan and a greedy multiple alignment algorithm.

Amino acid sequences of proteins performing the same function are likely to be somewhat similar, but these similarities may be elusive in the case of distant species. You now possess an arsenal of algorithms for aligning pairs of sequences, but if sequence similarity is weak, pairwise alignment may not identify biologically related sequences. However, simultaneous comparison of many sequences often allows us to find similarities that pairwise sequence comparison fails to reveal. Bioinformaticians sometimes say that while pairwise alignment whispers, multiple alignment shouts.

Building a three-dimensional Manhattan

We’re now ready to use pairwise sequence analysis to build up our intuition for comparison of multiple sequences. In our three-way alignment of A-domains from the introduction, we found 19 conserved columns:

However, similarities between A-domains are not limited to these 19 columns, as we can find 10+9+12=3110 + 9 + 12 = 31 semi-conservative columns, each of which has two matching amino acids:

T-way alignment

A multiple alignment of strings v1{v}^{1}, …, vt{v}^{t}, also called a t-way alignment, is specified by a matrix having t rows, where the i-th row contains the symbols of vi{v}_{i} in order, interspersed with space symbols. We also assume that no column in multiple alignments contains only space symbols. In the 3-way alignment below, we’ve highlighted the most popular symbol in each column using upper case letters:

The multiple alignment matrix is a generalization of the pairwise alignment matrix to more than two sequences. The three arrays shown below this alignment record the respective number of symbols in ATGTTATA, AGCGATCA, and ATCGTCTC encountered up to a given position. Together, these three arrays correspond to a path in a three-dimensional grid:

As the alignment graph for two sequences is a grid of squares, the alignment graph for three sequences is a grid of cubes. Every node in the 3-way alignment graph has up to seven incoming edges, as shown in the figure below.

The score of multiple alignments is defined as the sum of scores of the alignment columns (or, equivalently, weights of edges in the alignment path), with an optimal alignment being one that maximizes this score. In the case of an amino acid alphabet, we can use a very general scoring method that is defined by a t-dimensional matrix containing 21t21^{t} entries that describe the scores of all possible combinations of t symbols (representing 20 amino acids and the space symbol). Intuitively, we should reward more conserved columns with higher scores. For more details, see DETOUR: Scoring Multiple Alignments. Intuitively, we should reward more conserved columns with higher scores. For example, in the Multiple Longest Common Subsequence Problem, the score of a column is equal to 1 if all of the column’s symbols are identical, and 0 if even one symbol disagrees.

Multiple Alignment Problem

Problem overview:
Find the highest-scoring alignment between multiple strings under a given scoring matrix.

Input: A collection of t strings and a t-dimensional matrix Score.
Output: A multiple alignment of these strings whose score (as defined by the matrix Score) is maximized among all possible alignments of these strings.

Sample input:

PLEASANTLY

MEANLY

HAPPILY

Output:

3

PLE—ASANT----LY

—ME-A----N—LY

-----HA-----PPILY

Press + to interact
import itertools
MATCH = 1
MISMATCH = 0
GAP = 0
def MultipleAlignment(dna):
def go():
g = itertools.product([0, -1], repeat=k) #store the cartesion product of 0 and -1 taken k times in g
next(g) # get the next element of g
return g
k = len(dna) # store the length of dna in k
score = {} # make an map score
dir = {} # make a map dir
cells = itertools.product(*[range(len(d) + 1) for d in dna]) # store the cartesion product of range of len(d)+1 where d is iterator for dna
start = next(cells) # store the next of cells in start
score[start] = 0 # initialize the score[start]=0
dir[start] = None # initialize the dir[start]=None
for c in cells: # run loop on cells
score[c] = -10 ** 6 #initialize the score[c] with -10 ** 6
dir[c] = None # initialize the dir[c]=None
for d in go(): # run loop on go()
prev = tuple(map(lambda x, y: x + y, c, d)) # evaluate the function using lambda and make a list using map and than pass to tuple and store in prev
if any(x < 0 for x in prev): continue # check if any value in prev is 0 than continue
if d.count(0): # if at least one '-', then assign GAP to penalty
penalty = GAP
elif any(dna[i][prev[i]] != dna[0][prev[0]] for i in range(k)): # if unequal column, then assign MISMATCH to penalty
penalty = MISMATCH
else: # if all are equal in column, then assign MATCH to penalty
penalty = MATCH
if score[c] < score[prev] + penalty: # check if score[c] < score[prev] + penalty, then score[c] = score[prev] + penalty and dir[c] = d
score[c] = score[prev] + penalty
dir[c] = d
c = tuple(len(d) for d in dna) # assign tuple of len(d) where d is iterator for dna to c
final_score = score[c] # assign score[c] to final_score
alignment = ['' for _ in dna] # assign '' to alignment
d = dir[c] # assign dir[c] to d
#we don't need actual alignment for scoring, but let's find it
while d:
c = tuple(map(lambda x, y: x + y, c, d)) # evaluate the function using lambda and make a list using map and than pass to tuple and store in c
for i, g in enumerate(d): # run loop on d
if not g: # check if there is no g, then alignment[i] += '-'
alignment[i] += '-'
else: # else alignment[i] += dna[i][c[i]]
alignment[i] += dna[i][c[i]]
d = dir[c] # assign dir[c] to d
return '%d\n%s' % (final_score, '\n'.join(x[::-1] for x in alignment))
DNA = """AATATCCG
TCCGA
ATGTACTG""".splitlines()
print(MultipleAlignment(DNA))

A straightforward dynamic programming algorithm applied to the t-dimensional alignment graph solves the Multiple Alignment Problem for t strings. For three sequences v,w,{v, w,} and ...