...

/

Penalizing Insertions and Deletions in Sequence Alignment

Penalizing Insertions and Deletions in Sequence Alignment

Apply affine gap penalties to the sequence alignment problem.

Affine gap penalties

We’ve seen that introducing mismatch and indel penalties can produce more biologically adequate global alignments. However, even with this more robust scoring model, the A-domain alignment that we previously constructed (with indel penalty σσ = 4) still reveals only six of the eight conserved purple columns corresponding to the non-ribosomal signatures:

STOP and Think: Would increasing the indel penalty from σσ = 4 to σσ = 10 reveal the biologically correct alignment?

Inadequate scoring for biological sequences

In our previously defined linear scoring model, if σσ is the penalty for the insertion or deletion of a single symbol, then σσ · k is the penalty for the insertion or deletion of an interval of k symbols. This cost model, unfortunately, results in inadequate scoring for biological sequences. Mutations are often caused by errors in DNA replication that insert or delete an entire interval of k nucleotides as a single event instead of as k independent insertions or deletions. Thus, penalizing such an indel by σσ · k represents an excessive penalty. For example, the alignment on the right is more adequate than the alignment on the left, but they would currently receive the same score.

Affine gap penalties

A gap is a contiguous sequence of spaces in a row of an alignment. One way to score gaps more appropriately is to define an affine penalty for a gap of length k as σ+ε(k1)σ + ε · (k − 1), where σσ is the gap opening penalty, assessed to the first symbol in the gap, and εε is the gap extension penalty, assessed to each additional symbol in the gap. We select εε to be smaller than $σ so that the affine penalty for a gap of length k is smaller than the penalty for k independent single-nucleotide indels (σ · k). For example, with affine gap penalties, the alignment on the left above is penalized by 2σ, whereas the alignment on the right is only penalized by σ+εσ + ε.

Alignment with Affine Gap Penalties Problem

Problem overview:
Construct a highest-scoring global alignment between two strings (with affine gap penalties).

Input: Two strings, a scoring matrix Score, and numbers σ and ε.
Output: A highest-scoring global alignment between these strings, as defined by the scoring matrix score and by the gap opening and extension penalties σ and ε.

Sample dataset:

PRTEINS
PRTWPSEIN

Sample output:

8
PRT—EINS
PRTWPSEIN-

Press + to interact
#include <iostream>
#include <string>
#include <utility>
#include <map>
#include <vector>
#include <algorithm>
using namespace std;
const int INF = 1000000;
const char this_ = 0;
const char up_ = 1;
const char down_ = 2;
const char zero_ = 3; // new local alignment
struct AlignmentResult{
int score;
string s1;
string s2;
};
AlignmentResult alignment(string s1, string s2, int gapop, int gapex, map<char, map<char, int> > scoring_matrix) {
/* Total gap penalty is gapop + gapex * (L - 1) where L is the length of the gap */
gapop -= gapex;
vector<vector<vector<char> > > p(3, vector<vector<char> >(s1.size() + 1, vector<char>(s2.size() + 1))); // parent
vector<vector<vector<int> > > d(3, vector<vector<int> >(s1.size() + 1, vector<int>(s2.size() + 1))); // distance
// 0 - upper matrix (gaps)
// 1 - usual matrix
// 2 - lower matrix (gaps)
// base:
for (size_t i = 0; i <= s1.size(); ++i) {
d[2][i][0] = -gapop - i * gapex;
p[2][i][0] = this_;
d[1][i][0] = -gapop - i * gapex;
p[1][i][0] = down_;
d[0][i][0] = -INF;
p[0][i][0] = zero_;
}
for (size_t j = 0; j <= s2.size(); ++j) {
d[0][0][j] = -gapop - j * gapex;
p[0][0][j] = this_;
d[1][0][j] = -gapop - j * gapex;
p[1][0][j] = up_;
d[2][0][j] = -INF;
p[2][0][j] = zero_;
}
d[0][0][0] = d[1][0][0] = d[2][0][0] = 0;
p[0][0][0] = p[1][0][0] = p[2][0][0] = zero_;
// dynamic:
for (size_t i = 1; i <= s1.size(); ++i) {
for (size_t j = 1; j <= s2.size(); ++j) {
int up, down, ths;
int zero = 0;
// upper
ths = d[0][i][j-1] - gapex;
down = d[1][i][j-1] - gapop - gapex;
if (ths > down) {
d[0][i][j] = ths;
p[0][i][j] = this_;
}
else {
d[0][i][j] = down;
p[0][i][j] = down_;
}
// lower
ths = d[2][i-1][j] - gapex;
up = d[1][i-1][j] - gapop - gapex;
if (ths > up) {
d[2][i][j] = ths;
p[2][i][j] = this_;
}
else {
d[2][i][j] = up;
p[2][i][j] = up_;
}
// middle
ths = d[1][i-1][j-1] + scoring_matrix[s1[i-1]][s2[j-1]];
up = d[0][i][j];
down = d[2][i][j];
if (ths > up && ths > down ) {
d[1][i][j] = ths;
p[1][i][j] = this_;
}
else if (up > down) {
d[1][i][j] = up;
p[1][i][j] = up_;
}
else {
d[1][i][j] = down;
p[1][i][j] = down_;
}
}
}
// output distance
int maxscore;
int maxk;
int maxi;
int maxj;
maxi = s1.size();
maxj = s2.size();
maxk = 1;
maxscore = d[maxk][maxi][maxj];
for (int k = 0; k < 3; ++k) {
if (d[k][maxi][maxj] > maxscore) {
maxscore = d[k][maxi][maxj];
maxk = k;
}
}
string res1, res2;
int k = maxk;
int i = maxi;
int j = maxj;
while (i >= 0 && j >= 0 && k >= 0 && k <= 2) {
if (p[k][i][j] == this_ && k == 1) { // diagonal
i -= 1;
j -= 1;
res1 += s1[i];
res2 += s2[j];
}
else if (p[k][i][j] == this_ && k == 0) { // horisontal gap
j -= 1;
res1 += '-';
res2 += s2[j];
}
else if (p[k][i][j] == this_ && k == 2) { // vertical gap
i -= 1;
res1 += s1[i];
res2 += '-';
}
else if (p[k][i][j] == down_ && k == 0) { // gap opening/closing, 1->0
k = 1;
j -= 1;
res1 += '-';
res2 += s2[j];
}
else if (p[k][i][j] == down_ && k == 1) { // gap opening/closing, 2->1
k = 2;
}
else if (p[k][i][j] == up_ && k == 2) { // gap opening/closing, 1->2
k = 1;
i -= 1;
res1 += s1[i];
res2 += '-';
}
else if (p[k][i][j] == up_ && k == 1) { // gap opening/closing, 0->1
k = 0;
}
else {
break;
}
}
reverse(res1.begin(), res1.end());
reverse(res2.begin(), res2.end());
AlignmentResult ans;
ans.score = maxscore;
ans.s1 = res1;
ans.s2 = res2;
return ans;
}
int main() {
map<char, map <char,int> > blosum62 = {
{'W', {{'F', 1}, {'R', -3}, {'A', -3}, {'B', -4}, {'Z', -3}, {'E', -3}, {'M', -1}, {'I', -3}, {'Q', -2}, {'X', -2}, {'D', -4}, {'L', -2}, {'H', -2}, {'T', -2}, {'P', -4}, {'Y', 2}, {'V', -3}, {'W', 11}, {'G', -2}, {'C', -2}, {'K', -3}, {'N', -4}, {'S', -3}}},
{'F', {{'W', 1}, {'T', -2}, {'Q', -3}, {'A', -2}, {'E', -3}, {'I', 0}, {'M', 0}, {'Y', 3}, {'V', -1}, {'R', -3}, {'F', 6}, {'N', -3}, {'S', -2}, {'P', -4}, {'X', -1}, {'C', -2}, {'G', -3}, {'K', -3}, {'Z', -3}, {'D', -3}, {'H', -1}, {'B', -3}, {'L', 0}}},
{'L', {{'R', -2}, {'N', -3}, {'P', -3}, {'X', -1}, {'Q', -2}, {'A', -1}, {'E', -3}, {'V', 1}, {'K', -2}, {'Z', -3}, {'B', -4}, {'S', -2}, {'W', -2}, {'D', -4}, {'H', -3}, {'L', 4}, {'T', -1}, {'M', 2}, {'C', -1}, {'G', -4}, {'Y', -1}, {'I', 2}, {'F', 0}}},
{'R', {{'L', -2}, {'W', -3}, {'G', -2}, {'M', -1}, {'T', -1}, {'D', -2}, {'Q', 1}, {'N', 0}, {'K', 2}, {'Y', -2}, {'V', -3},{'F', -3}, {'S', -1}, {'H', 0}, {'P', -2}, {'A', -1}, {'I', -3}, {'X', -1}, {'E', 0}, {'C', -3}, {'R', 5}, {'B', -1}, {'Z', 0}}},
{'S', {{'P', -1}, {'D', 0}, {'H', -1}, {'B', 0}, {'V', -2}, {'S', 4}, {'C', -1}, {'G', 0}, {'K', 0}, {'L', -2}, {'X', 0}, {'R', -1}, {'F', -2}, {'N', 1}, {'Z', 0}, {'Y', -2}, {'Q', 0}, {'A', 1}, {'W', -3}, {'E', 0}, {'I', -2}, {'M', -1}, {'T', 1}}},
{'P', {{'S', -1}, {'P', 7}, {'Z', -1}, {'D', -1}, {'L', -3}, {'H', -2}, {'T', -1}, {'B', -2}, {'G', -2}, {'C', -3}, {'K', -1}, {'W', -4}, {'R', -2}, {'F', -4}, {'N', -2}, {'X', -2}, {'Y', -3}, {'Q', -1}, {'V', -2}, {'E', -1}, {'A', -1}, {'M', -2}, {'I', -3}}},
{'V', {{'T', 0}, {'D', -3}, {'A', 0}, {'E', -2}, {'I', 3}, {'S', -2}, {'M', 1}, {'Q', -2}, {'Z', -2}, {'L', 1}, {'F', -1}, {'N', -3}, {'R', -3},{ 'V', 4}, {'Y', -1}, {'B', -3}, {'C', -1}, {'G', -3}, {'K', -2}, {'W', -3}, {'X', -1}, {'H', -3}, {'P', -2}}},
{'T', {{'V', 0}, {'N', 0}, {'F', -2}, {'R', -1}, {'M', -1}, {'X', 0}, {'I', -1}, {'A', 0}, {'P', -1}, {'E', -1}, {'Q', -1}, {'W', -2}, {'T', 5}, {'B', -1}, {'H', -2}, {'L', -1}, {'D', -1}, {'Y', -2}, {'K', -1}, {'C', -1}, {'G', -2}, {'Z', -1}, {'S', 1}}},
{'Q', {{'Q', 5}, {'A', -1}, {'Y', -1}, {'V', -2}, {'F', -3}, {'L', -2}, {'R', 1}, {'C', -3}, {'W', -2}, {'N', 0}, {'G', -2}, {'M', 0}, {'T', -1}, {'D', 0}, {'E', 2}, {'K', 1}, {'B', 0}, {'Z', 3}, {'S', 0}, {'H', 0}, {'P', -1}, {'I', -3}, {'X', -1}}},
{'N', {{'A', -2}, {'L', -3}, {'G', 0}, {'T', 0}, {'M', -2}, {'D', 1}, {'N', 6}, {'Q', 0}, {'R', 0}, {'Y', -2}, {'V', -3}, {'F', -3}, {'H', 1}, {'S', 1}, {'C', -3}, {'I', -3}, {'P', -2}, {'X', -1}, {'W', -4}, {'E', 0}, {'K', 0}, {'Z', 0}, {'B', 3}}},
{'A', {{'N', -2}, {'Q', -1}, {'W', -3}, {'Y', -2}, {'V', 0}, {'F', -2}, {'L', -1}, {'C', 0}, {'G', 0}, {'T', 0}, {'M', -1}, {'D', -2}, {'E', -1}, {'R', -1}, {'K', -1}, {'Z', -1}, {'B', -2}, {'S', 1}, {'H', -2}, {'A', 4}, {'P', -1}, {'I', -1}, {'X', 0}}},
{'Z', {{'Y', -2}, {'Z', 4}, {'P', -1}, {'G', -2}, {'C', -3}, {'K', 1}, {'W', -3}, {'I', -3}, {'V', -2}, {'D', 1}, {'L', -3}, {'H', 0}, {'S', 0}, {'E', 4}, {'A', -1}, {'M', -1}, {'Q', 3}, {'X', -1}, {'F', -3}, {'T', -1}, {'B', 1}, {'N', 0}, {'R', 0}}},
{'Y', {{'Z', -2}, {'M', -1}, {'I', -1}, {'E', -2}, {'B', -3}, {'A', -2}, {'Y', 7}, {'Q', -1}, {'N', -2}, {'F', 3}, {'R', -2}, {'V', -1}, {'K', -2}, {'G', -3}, {'C', -2}, {'W', 2}, {'S', -2}, {'L', -1}, {'H', 2}, {'D', -3}, {'T', -2}, {'P', -3}, {'X', -1}}},
{'D', {{'S', 0}, {'H', -1}, {'V', -3}, {'P', -1}, {'I', -3}, {'R', -2}, {'X', -1}, {'N', 1}, {'E', 2}, {'C', -3}, {'K', -1}, {'Z', 1}, {'Q', 0}, {'B', 4}, {'A', -2}, {'W', -4}, {'L', -4}, {'G', -1}, {'T', -1}, {'M', -3}, {'D', 6}, {'Y', -3}, {'F', -3}}},
{'H', {{'H', 8}, {'S', -1}, {'D', -1}, {'I', -3}, {'P', -2}, {'X', -1}, {'G', -2}, {'C', -3}, {'K', -1}, {'Z', 0}, {'B', 0}, {'R', 0}, {'W', -2}, {'N', 1}, {'L', -3}, {'T', -2}, {'M', -2}, {'Q', 0}, {'E', 0}, {'A', -2}, {'Y', 2}, {'V', -3}, {'F', -1}}},
{'M', {{'Y', -1}, {'R', -1}, {'V', 1}, {'N', -2}, {'T', -1}, {'F', 0}, {'W', -1}, {'Q', 0}, {'A', -1}, {'E', -2}, {'I', 1}, {'M', 5}, {'D', -3}, {'H', -2}, {'L', 2}, {'Z', -1}, {'B', -3}, {'S', -1}, {'C', -1}, {'G', -3}, {'P', -2}, {'K', -1}, {'X', -1}}},
{'G', {{'R', -2}, {'N', 0}, {'K', -2}, {'Z', -2}, {'B', -1}, {'S', 0}, {'H', -2}, {'Q', -2}, {'E', -2}, {'A', 0}, {'P', -2}, {'I', -4}, {'X', -1}, {'Y', -3}, {'D', -1}, {'V', -3}, {'F', -3}, {'W', -2}, {'L', -4}, {'G', 6}, {'C', -3}, {'T', -2}, {'M', -3}}},
{'I', {{'Y', -1}, {'V', 3}, {'H', -3}, {'D', -3}, {'F', 0}, {'Z', -3}, {'W', -3}, {'T', -1}, {'M', 1}, {'G', -4}, {'C', -1}, {'B', -3}, {'R', -3}, {'N', -3}, {'K', -3}, {'S', -2}, {'Q', -3}, {'I', 4}, {'E', -3}, {'A', -1}, {'P', -3}, {'L', 2}, {'X', -1}}},
{'E', {{'Y', -2}, {'C', -4}, {'V', -2}, {'F', -3}, {'W', -3}, {'L', -3}, {'G', -2}, {'D', 2}, {'T', -1}, {'M', -2}, {'Q', 2}, {'A', -1}, {'E', 5}, {'K', 1}, {'Z', 4}, {'B', 1}, {'S', 0}, {'H', 0}, {'R', 0}, {'N', 0}, {'P', -1}, {'I', -3}, {'X', -1}}},
{'B', {{'Y', -3}, {'S', 0}, {'W', -4}, {'K', 0}, {'G', -1}, {'C', -3}, {'P', -2}, {'L', -4}, {'H', 0}, {'D', 4}, {'I', -3}, {'T', -1}, {'V', -3}, {'Q', 0}, {'X', -1}, {'M', -3}, {'E', 1}, {'A', -2}, {'Z', 1}, {'R', -1}, {'N', 3}, {'F', -3}, {'B', 4}}},
{'C', {{'E', -4}, {'C', 9}, {'Z', -3}, {'B', -3}, {'Q', -3}, {'S', -1}, {'H', -3}, {'A', 0}, {'D', -3}, {'P', -3}, {'I', -1}, {'X', -2}, {'Y', -2}, {'N', -3}, {'V', -1}, {'K', -3}, {'F', -2}, {'W', -2}, {'L', -1}, {'G', -3}, {'T', -1}, {'R', -3}, {'M', -1}}},
{'K', {{'K', 5}, {'G', -2}, {'Z', 1}, {'B', 0}, {'S', 0}, {'R', 2}, {'H', -1}, {'L', -2}, {'P', -1}, {'D', -1}, {'X', -1}, {'Y', -2}, {'V', -2}, {'Q', 1}, {'I', -3}, {'A', -1}, {'E', 1}, {'C', -3}, {'F', -3}, {'W', -3}, {'T', -1}, {'N', 0}, {'M', -1}}},
{'X', {{'L', -1}, {'H', -1}, {'D', -1}, {'X', -1}, {'T', 0}, {'K', -1}, {'G', -1}, {'C', -2}, {'W', -2}, {'S', 0}, {'N', -1}, {'F', -1}, {'B', -1}, {'Z', -1}, {'V', -1}, {'R', -1}, {'P', -2}, {'M', -1}, {'I', -1}, {'E', -1}, {'A', 0}, {'Y', -1}, {'Q', -1}}}
};
AlignmentResult result = alignment(
"PRTEINS",
"PRTWPSEIN",
11, 1, blosum62
);
cout << result.score << endl;
cout << result.s1 << endl;
cout << result.s2 << endl;
return 0;
}

The figure below illustrates how affine gap penalties can be modeled in the alignment graph by introducing a new “long” edge for each gap.

Since we don’t know in advance where gaps should be located, we need to add edges accounting for every possible gap. Thus, affine gap penalties can be accommodated by adding all possible vertical and horizontal edges in the alignment graph to represent all possible gaps. Specifically, we add edges connecting (i,j)(i, j) to both (i+k,j)(i + k, j) and (i,j+k)(i, j + k) with weights σ+ε(k1)σ + ε · (k − 1) for all possible gap sizes k, as illustrated in the figure below. For two sequences of length n, the number of edges in the resulting alignment graph modeling affine gap penalties increases from O(n2)O(n^{2}) ...