Penalizing Insertions and Deletions in Sequence Alignment
Apply affine gap penalties to the sequence alignment problem.
Affine gap penalties
We’ve seen that introducing mismatch and indel penalties can produce more biologically adequate global alignments. However, even with this more robust scoring model, the A-domain alignment that we previously constructed (with indel penalty = 4) still reveals only six of the eight conserved purple columns corresponding to the non-ribosomal signatures:
STOP and Think: Would increasing the indel penalty from = 4 to = 10 reveal the biologically correct alignment?
Inadequate scoring for biological sequences
In our previously defined linear scoring model, if is the penalty for the insertion or deletion of a single symbol, then · k is the penalty for the insertion or deletion of an interval of k symbols. This cost model, unfortunately, results in inadequate scoring for biological sequences. Mutations are often caused by errors in DNA replication that insert or delete an entire interval of k nucleotides as a single event instead of as k independent insertions or deletions. Thus, penalizing such an indel by · k represents an excessive penalty. For example, the alignment on the right is more adequate than the alignment on the left, but they would currently receive the same score.
Affine gap penalties
A gap is a contiguous sequence of spaces in a row of an alignment. One way to score gaps more appropriately is to define an affine penalty for a gap of length k as , where is the gap opening penalty, assessed to the first symbol in the gap, and is the gap extension penalty, assessed to each additional symbol in the gap. We select to be smaller than $σ so that the affine penalty for a gap of length k is smaller than the penalty for k independent single-nucleotide indels (σ · k). For example, with affine gap penalties, the alignment on the left above is penalized by , whereas the alignment on the right is only penalized by .
Alignment with Affine Gap Penalties Problem
Problem overview:
Construct a highest-scoring global alignment between two strings (with affine gap penalties).
Input: Two strings, a scoring matrix Score, and numbers σ and ε.
Output: A highest-scoring global alignment between these strings, as defined by the scoring matrix score and by the gap opening and extension penalties σ and ε.
Sample dataset:
PRTEINS
PRTWPSEIN
Sample output:
8
PRT—EINS
PRTWPSEIN-
#include <iostream>#include <string>#include <utility>#include <map>#include <vector>#include <algorithm>using namespace std;const int INF = 1000000;const char this_ = 0;const char up_ = 1;const char down_ = 2;const char zero_ = 3; // new local alignmentstruct AlignmentResult{int score;string s1;string s2;};AlignmentResult alignment(string s1, string s2, int gapop, int gapex, map<char, map<char, int> > scoring_matrix) {/* Total gap penalty is gapop + gapex * (L - 1) where L is the length of the gap */gapop -= gapex;vector<vector<vector<char> > > p(3, vector<vector<char> >(s1.size() + 1, vector<char>(s2.size() + 1))); // parentvector<vector<vector<int> > > d(3, vector<vector<int> >(s1.size() + 1, vector<int>(s2.size() + 1))); // distance// 0 - upper matrix (gaps)// 1 - usual matrix// 2 - lower matrix (gaps)// base:for (size_t i = 0; i <= s1.size(); ++i) {d[2][i][0] = -gapop - i * gapex;p[2][i][0] = this_;d[1][i][0] = -gapop - i * gapex;p[1][i][0] = down_;d[0][i][0] = -INF;p[0][i][0] = zero_;}for (size_t j = 0; j <= s2.size(); ++j) {d[0][0][j] = -gapop - j * gapex;p[0][0][j] = this_;d[1][0][j] = -gapop - j * gapex;p[1][0][j] = up_;d[2][0][j] = -INF;p[2][0][j] = zero_;}d[0][0][0] = d[1][0][0] = d[2][0][0] = 0;p[0][0][0] = p[1][0][0] = p[2][0][0] = zero_;// dynamic:for (size_t i = 1; i <= s1.size(); ++i) {for (size_t j = 1; j <= s2.size(); ++j) {int up, down, ths;int zero = 0;// upperths = d[0][i][j-1] - gapex;down = d[1][i][j-1] - gapop - gapex;if (ths > down) {d[0][i][j] = ths;p[0][i][j] = this_;}else {d[0][i][j] = down;p[0][i][j] = down_;}// lowerths = d[2][i-1][j] - gapex;up = d[1][i-1][j] - gapop - gapex;if (ths > up) {d[2][i][j] = ths;p[2][i][j] = this_;}else {d[2][i][j] = up;p[2][i][j] = up_;}// middleths = d[1][i-1][j-1] + scoring_matrix[s1[i-1]][s2[j-1]];up = d[0][i][j];down = d[2][i][j];if (ths > up && ths > down ) {d[1][i][j] = ths;p[1][i][j] = this_;}else if (up > down) {d[1][i][j] = up;p[1][i][j] = up_;}else {d[1][i][j] = down;p[1][i][j] = down_;}}}// output distanceint maxscore;int maxk;int maxi;int maxj;maxi = s1.size();maxj = s2.size();maxk = 1;maxscore = d[maxk][maxi][maxj];for (int k = 0; k < 3; ++k) {if (d[k][maxi][maxj] > maxscore) {maxscore = d[k][maxi][maxj];maxk = k;}}string res1, res2;int k = maxk;int i = maxi;int j = maxj;while (i >= 0 && j >= 0 && k >= 0 && k <= 2) {if (p[k][i][j] == this_ && k == 1) { // diagonali -= 1;j -= 1;res1 += s1[i];res2 += s2[j];}else if (p[k][i][j] == this_ && k == 0) { // horisontal gapj -= 1;res1 += '-';res2 += s2[j];}else if (p[k][i][j] == this_ && k == 2) { // vertical gapi -= 1;res1 += s1[i];res2 += '-';}else if (p[k][i][j] == down_ && k == 0) { // gap opening/closing, 1->0k = 1;j -= 1;res1 += '-';res2 += s2[j];}else if (p[k][i][j] == down_ && k == 1) { // gap opening/closing, 2->1k = 2;}else if (p[k][i][j] == up_ && k == 2) { // gap opening/closing, 1->2k = 1;i -= 1;res1 += s1[i];res2 += '-';}else if (p[k][i][j] == up_ && k == 1) { // gap opening/closing, 0->1k = 0;}else {break;}}reverse(res1.begin(), res1.end());reverse(res2.begin(), res2.end());AlignmentResult ans;ans.score = maxscore;ans.s1 = res1;ans.s2 = res2;return ans;}int main() {map<char, map <char,int> > blosum62 = {{'W', {{'F', 1}, {'R', -3}, {'A', -3}, {'B', -4}, {'Z', -3}, {'E', -3}, {'M', -1}, {'I', -3}, {'Q', -2}, {'X', -2}, {'D', -4}, {'L', -2}, {'H', -2}, {'T', -2}, {'P', -4}, {'Y', 2}, {'V', -3}, {'W', 11}, {'G', -2}, {'C', -2}, {'K', -3}, {'N', -4}, {'S', -3}}},{'F', {{'W', 1}, {'T', -2}, {'Q', -3}, {'A', -2}, {'E', -3}, {'I', 0}, {'M', 0}, {'Y', 3}, {'V', -1}, {'R', -3}, {'F', 6}, {'N', -3}, {'S', -2}, {'P', -4}, {'X', -1}, {'C', -2}, {'G', -3}, {'K', -3}, {'Z', -3}, {'D', -3}, {'H', -1}, {'B', -3}, {'L', 0}}},{'L', {{'R', -2}, {'N', -3}, {'P', -3}, {'X', -1}, {'Q', -2}, {'A', -1}, {'E', -3}, {'V', 1}, {'K', -2}, {'Z', -3}, {'B', -4}, {'S', -2}, {'W', -2}, {'D', -4}, {'H', -3}, {'L', 4}, {'T', -1}, {'M', 2}, {'C', -1}, {'G', -4}, {'Y', -1}, {'I', 2}, {'F', 0}}},{'R', {{'L', -2}, {'W', -3}, {'G', -2}, {'M', -1}, {'T', -1}, {'D', -2}, {'Q', 1}, {'N', 0}, {'K', 2}, {'Y', -2}, {'V', -3},{'F', -3}, {'S', -1}, {'H', 0}, {'P', -2}, {'A', -1}, {'I', -3}, {'X', -1}, {'E', 0}, {'C', -3}, {'R', 5}, {'B', -1}, {'Z', 0}}},{'S', {{'P', -1}, {'D', 0}, {'H', -1}, {'B', 0}, {'V', -2}, {'S', 4}, {'C', -1}, {'G', 0}, {'K', 0}, {'L', -2}, {'X', 0}, {'R', -1}, {'F', -2}, {'N', 1}, {'Z', 0}, {'Y', -2}, {'Q', 0}, {'A', 1}, {'W', -3}, {'E', 0}, {'I', -2}, {'M', -1}, {'T', 1}}},{'P', {{'S', -1}, {'P', 7}, {'Z', -1}, {'D', -1}, {'L', -3}, {'H', -2}, {'T', -1}, {'B', -2}, {'G', -2}, {'C', -3}, {'K', -1}, {'W', -4}, {'R', -2}, {'F', -4}, {'N', -2}, {'X', -2}, {'Y', -3}, {'Q', -1}, {'V', -2}, {'E', -1}, {'A', -1}, {'M', -2}, {'I', -3}}},{'V', {{'T', 0}, {'D', -3}, {'A', 0}, {'E', -2}, {'I', 3}, {'S', -2}, {'M', 1}, {'Q', -2}, {'Z', -2}, {'L', 1}, {'F', -1}, {'N', -3}, {'R', -3},{ 'V', 4}, {'Y', -1}, {'B', -3}, {'C', -1}, {'G', -3}, {'K', -2}, {'W', -3}, {'X', -1}, {'H', -3}, {'P', -2}}},{'T', {{'V', 0}, {'N', 0}, {'F', -2}, {'R', -1}, {'M', -1}, {'X', 0}, {'I', -1}, {'A', 0}, {'P', -1}, {'E', -1}, {'Q', -1}, {'W', -2}, {'T', 5}, {'B', -1}, {'H', -2}, {'L', -1}, {'D', -1}, {'Y', -2}, {'K', -1}, {'C', -1}, {'G', -2}, {'Z', -1}, {'S', 1}}},{'Q', {{'Q', 5}, {'A', -1}, {'Y', -1}, {'V', -2}, {'F', -3}, {'L', -2}, {'R', 1}, {'C', -3}, {'W', -2}, {'N', 0}, {'G', -2}, {'M', 0}, {'T', -1}, {'D', 0}, {'E', 2}, {'K', 1}, {'B', 0}, {'Z', 3}, {'S', 0}, {'H', 0}, {'P', -1}, {'I', -3}, {'X', -1}}},{'N', {{'A', -2}, {'L', -3}, {'G', 0}, {'T', 0}, {'M', -2}, {'D', 1}, {'N', 6}, {'Q', 0}, {'R', 0}, {'Y', -2}, {'V', -3}, {'F', -3}, {'H', 1}, {'S', 1}, {'C', -3}, {'I', -3}, {'P', -2}, {'X', -1}, {'W', -4}, {'E', 0}, {'K', 0}, {'Z', 0}, {'B', 3}}},{'A', {{'N', -2}, {'Q', -1}, {'W', -3}, {'Y', -2}, {'V', 0}, {'F', -2}, {'L', -1}, {'C', 0}, {'G', 0}, {'T', 0}, {'M', -1}, {'D', -2}, {'E', -1}, {'R', -1}, {'K', -1}, {'Z', -1}, {'B', -2}, {'S', 1}, {'H', -2}, {'A', 4}, {'P', -1}, {'I', -1}, {'X', 0}}},{'Z', {{'Y', -2}, {'Z', 4}, {'P', -1}, {'G', -2}, {'C', -3}, {'K', 1}, {'W', -3}, {'I', -3}, {'V', -2}, {'D', 1}, {'L', -3}, {'H', 0}, {'S', 0}, {'E', 4}, {'A', -1}, {'M', -1}, {'Q', 3}, {'X', -1}, {'F', -3}, {'T', -1}, {'B', 1}, {'N', 0}, {'R', 0}}},{'Y', {{'Z', -2}, {'M', -1}, {'I', -1}, {'E', -2}, {'B', -3}, {'A', -2}, {'Y', 7}, {'Q', -1}, {'N', -2}, {'F', 3}, {'R', -2}, {'V', -1}, {'K', -2}, {'G', -3}, {'C', -2}, {'W', 2}, {'S', -2}, {'L', -1}, {'H', 2}, {'D', -3}, {'T', -2}, {'P', -3}, {'X', -1}}},{'D', {{'S', 0}, {'H', -1}, {'V', -3}, {'P', -1}, {'I', -3}, {'R', -2}, {'X', -1}, {'N', 1}, {'E', 2}, {'C', -3}, {'K', -1}, {'Z', 1}, {'Q', 0}, {'B', 4}, {'A', -2}, {'W', -4}, {'L', -4}, {'G', -1}, {'T', -1}, {'M', -3}, {'D', 6}, {'Y', -3}, {'F', -3}}},{'H', {{'H', 8}, {'S', -1}, {'D', -1}, {'I', -3}, {'P', -2}, {'X', -1}, {'G', -2}, {'C', -3}, {'K', -1}, {'Z', 0}, {'B', 0}, {'R', 0}, {'W', -2}, {'N', 1}, {'L', -3}, {'T', -2}, {'M', -2}, {'Q', 0}, {'E', 0}, {'A', -2}, {'Y', 2}, {'V', -3}, {'F', -1}}},{'M', {{'Y', -1}, {'R', -1}, {'V', 1}, {'N', -2}, {'T', -1}, {'F', 0}, {'W', -1}, {'Q', 0}, {'A', -1}, {'E', -2}, {'I', 1}, {'M', 5}, {'D', -3}, {'H', -2}, {'L', 2}, {'Z', -1}, {'B', -3}, {'S', -1}, {'C', -1}, {'G', -3}, {'P', -2}, {'K', -1}, {'X', -1}}},{'G', {{'R', -2}, {'N', 0}, {'K', -2}, {'Z', -2}, {'B', -1}, {'S', 0}, {'H', -2}, {'Q', -2}, {'E', -2}, {'A', 0}, {'P', -2}, {'I', -4}, {'X', -1}, {'Y', -3}, {'D', -1}, {'V', -3}, {'F', -3}, {'W', -2}, {'L', -4}, {'G', 6}, {'C', -3}, {'T', -2}, {'M', -3}}},{'I', {{'Y', -1}, {'V', 3}, {'H', -3}, {'D', -3}, {'F', 0}, {'Z', -3}, {'W', -3}, {'T', -1}, {'M', 1}, {'G', -4}, {'C', -1}, {'B', -3}, {'R', -3}, {'N', -3}, {'K', -3}, {'S', -2}, {'Q', -3}, {'I', 4}, {'E', -3}, {'A', -1}, {'P', -3}, {'L', 2}, {'X', -1}}},{'E', {{'Y', -2}, {'C', -4}, {'V', -2}, {'F', -3}, {'W', -3}, {'L', -3}, {'G', -2}, {'D', 2}, {'T', -1}, {'M', -2}, {'Q', 2}, {'A', -1}, {'E', 5}, {'K', 1}, {'Z', 4}, {'B', 1}, {'S', 0}, {'H', 0}, {'R', 0}, {'N', 0}, {'P', -1}, {'I', -3}, {'X', -1}}},{'B', {{'Y', -3}, {'S', 0}, {'W', -4}, {'K', 0}, {'G', -1}, {'C', -3}, {'P', -2}, {'L', -4}, {'H', 0}, {'D', 4}, {'I', -3}, {'T', -1}, {'V', -3}, {'Q', 0}, {'X', -1}, {'M', -3}, {'E', 1}, {'A', -2}, {'Z', 1}, {'R', -1}, {'N', 3}, {'F', -3}, {'B', 4}}},{'C', {{'E', -4}, {'C', 9}, {'Z', -3}, {'B', -3}, {'Q', -3}, {'S', -1}, {'H', -3}, {'A', 0}, {'D', -3}, {'P', -3}, {'I', -1}, {'X', -2}, {'Y', -2}, {'N', -3}, {'V', -1}, {'K', -3}, {'F', -2}, {'W', -2}, {'L', -1}, {'G', -3}, {'T', -1}, {'R', -3}, {'M', -1}}},{'K', {{'K', 5}, {'G', -2}, {'Z', 1}, {'B', 0}, {'S', 0}, {'R', 2}, {'H', -1}, {'L', -2}, {'P', -1}, {'D', -1}, {'X', -1}, {'Y', -2}, {'V', -2}, {'Q', 1}, {'I', -3}, {'A', -1}, {'E', 1}, {'C', -3}, {'F', -3}, {'W', -3}, {'T', -1}, {'N', 0}, {'M', -1}}},{'X', {{'L', -1}, {'H', -1}, {'D', -1}, {'X', -1}, {'T', 0}, {'K', -1}, {'G', -1}, {'C', -2}, {'W', -2}, {'S', 0}, {'N', -1}, {'F', -1}, {'B', -1}, {'Z', -1}, {'V', -1}, {'R', -1}, {'P', -2}, {'M', -1}, {'I', -1}, {'E', -1}, {'A', 0}, {'Y', -1}, {'Q', -1}}}};AlignmentResult result = alignment("PRTEINS","PRTWPSEIN",11, 1, blosum62);cout << result.score << endl;cout << result.s1 << endl;cout << result.s2 << endl;return 0;}
The figure below illustrates how affine gap penalties can be modeled in the alignment graph by introducing a new “long” edge for each gap.
Since we don’t know in advance where gaps should be located, we need to add edges accounting for every possible gap. Thus, affine gap penalties can be accommodated by adding all possible vertical and horizontal edges in the alignment graph to represent all possible gaps. Specifically, we add edges connecting to both and with weights for all possible gap sizes k, as illustrated in the figure below. For two sequences of length n, the number of edges in the resulting alignment graph modeling affine gap penalties increases from ...