Introduction to Sequence Alignment

Explore sequence alignment by treating it as a game.

We'll cover the following

Sequence alignment as a game

To simplify matters, we’ll compare only two sequences at a time, returning to multiple sequence comparison at the end of the chapter. The Hamming distance, which counts mismatches in two strings, rigidly assumes that we align the i-th symbol of one sequence against the i-th symbol of the other. However, since biological sequences are subject to insertions and deletions, it’s often the case that the i-th symbol of one sequence corresponds to a symbol at a completely different position in the other sequence. The goal, then, is to find the most appropriate correspondence of symbols.

For example, ATGCATGC and _TGCATGCA have no matching positions, and so their Hamming distance is equal to 8:

ATGCATGC
TGCATGCA

Yet these strings have seven matching positions if we align them differently:

ATGCATGC-
-TGCATGCA

Strings ATGCTTA and TGCATTAA have more subtle similarities:

ATGC-TTA-
-TGCATTAA

These examples lead us to postulate a notion of a good alignment as one that matches as many symbols as possible. You can think about maximizing the number of matched symbols in two strings as a single-person game (figure below). At each turn, you have two choices. You can remove the first symbol from each sequence, in which case you earn a point if the symbols match; alternatively, you can remove the first symbol from either of the two sequences, in which case you earn no points but may set yourself up to earn more points in later moves. Your goal is to maximize the number of points.

STOP and Think: The figure below shows just one of many possible ways to play the alignment game for the strings ATGCATGC and TGCATGCA. Can you find an even better way to play this game for the strings?

Get hands-on with 1400+ tech skills courses.