Epilogue: Multiple Sequence Alignment
Learn multiple sequence alignment with the help of a three-dimensional Manhattan and a greedy multiple alignment algorithm.
We'll cover the following...
Amino acid sequences of proteins performing the same function are likely to be somewhat similar, but these similarities may be elusive in the case of distant species. You now possess an arsenal of algorithms for aligning pairs of sequences, but if sequence similarity is weak, pairwise alignment may not identify biologically related sequences. However, simultaneous comparison of many sequences often allows us to find similarities that pairwise sequence comparison fails to reveal. Bioinformaticians sometimes say that while pairwise alignment whispers, multiple alignment shouts.
Building a three-dimensional Manhattan
We’re now ready to use pairwise sequence analysis to build up our intuition for comparison of multiple sequences. In our three-way alignment of A-domains from the introduction, we found 19 conserved columns:
However, similarities between A-domains are not limited to these 19 columns, as we can find semi-conservative columns, each of which has two matching amino acids:
T-way alignment
A multiple alignment of strings , …, , also called a t-way alignment, is specified by a matrix having t rows, where the i-th row contains the symbols of in order, interspersed with space symbols. We also assume that no column in multiple alignments contains only space symbols. In the 3-way alignment below, we’ve highlighted the most popular symbol in each column using upper case letters:
The multiple alignment matrix is a generalization of the pairwise alignment matrix to more than two sequences. The three arrays shown below this alignment record the respective number of symbols in ATGTTATA, AGCGATCA, and ATCGTCTC encountered up to a given position. Together, these three arrays correspond to a path in a three-dimensional grid:
As the alignment graph for two sequences is a grid of squares, the alignment graph for three sequences is a grid of cubes. Every node in the 3-way alignment graph has up to seven incoming edges, as shown in the figure below.
The score of multiple alignments is defined as the sum of scores of the alignment columns (or, equivalently, weights of edges in the alignment path), with an optimal alignment being one that maximizes this score. In the case of an amino acid alphabet, we can use a very general scoring method that is defined by a t-dimensional matrix containing entries that describe the scores of all possible combinations of t symbols (representing 20 amino acids and the space symbol). Intuitively, we should reward more conserved columns with higher scores. For more details, see DETOUR: Scoring Multiple Alignments. Intuitively, we should reward more conserved columns with higher scores. For example, in the Multiple Longest Common Subsequence Problem, the score of a column is equal to 1 if all of the column’s symbols are identical, and 0 if even one symbol disagrees.
Multiple Alignment Problem
Problem overview:
Find the highest-scoring alignment between multiple strings under a given scoring matrix.
Input: A collection of t strings and a t-dimensional matrix Score.
Output: A multiple alignment of these strings whose score (as defined by the matrix Score) is maximized among all possible alignments of these strings.
Sample input:
PLEASANTLY
MEANLY
HAPPILY
Output:
3
PLE—ASANT----LY
—ME-A----N—LY
-----HA-----PPILY
import itertoolsMATCH = 1MISMATCH = 0GAP = 0def MultipleAlignment(dna):def go():g = itertools.product([0, -1], repeat=k) #store the cartesion product of 0 and -1 taken k times in gnext(g) # get the next element of greturn gk = len(dna) # store the length of dna in kscore = {} # make an map scoredir = {} # make a map dircells = itertools.product(*[range(len(d) + 1) for d in dna]) # store the cartesion product of range of len(d)+1 where d is iterator for dnastart = next(cells) # store the next of cells in startscore[start] = 0 # initialize the score[start]=0dir[start] = None # initialize the dir[start]=Nonefor c in cells: # run loop on cellsscore[c] = -10 ** 6 #initialize the score[c] with -10 ** 6dir[c] = None # initialize the dir[c]=Nonefor d in go(): # run loop on go()prev = tuple(map(lambda x, y: x + y, c, d)) # evaluate the function using lambda and make a list using map and than pass to tuple and store in previf any(x < 0 for x in prev): continue # check if any value in prev is 0 than continueif d.count(0): # if at least one '-', then assign GAP to penaltypenalty = GAPelif any(dna[i][prev[i]] != dna[0][prev[0]] for i in range(k)): # if unequal column, then assign MISMATCH to penaltypenalty = MISMATCHelse: # if all are equal in column, then assign MATCH to penaltypenalty = MATCHif score[c] < score[prev] + penalty: # check if score[c] < score[prev] + penalty, then score[c] = score[prev] + penalty and dir[c] = dscore[c] = score[prev] + penaltydir[c] = dc = tuple(len(d) for d in dna) # assign tuple of len(d) where d is iterator for dna to cfinal_score = score[c] # assign score[c] to final_scorealignment = ['' for _ in dna] # assign '' to alignmentd = dir[c] # assign dir[c] to d#we don't need actual alignment for scoring, but let's find itwhile d:c = tuple(map(lambda x, y: x + y, c, d)) # evaluate the function using lambda and make a list using map and than pass to tuple and store in cfor i, g in enumerate(d): # run loop on dif not g: # check if there is no g, then alignment[i] += '-'alignment[i] += '-'else: # else alignment[i] += dna[i][c[i]]alignment[i] += dna[i][c[i]]d = dir[c] # assign dir[c] to dreturn '%d\n%s' % (final_score, '\n'.join(x[::-1] for x in alignment))DNA = """AATATCCGTCCGAATGTACTG""".splitlines()print(MultipleAlignment(DNA))
A straightforward dynamic programming algorithm applied to the t-dimensional alignment graph solves the Multiple Alignment Problem for t strings. For three sequences and ...