Given a (k, d)-mer (a1_{1} . . . ak_{k} | b1_{1}, . . . bk_{k}), we define its prefix and suffix as the following (k −1, d + 1)-mers:

Prefix((a1_{1}ak_{k} |b1_{1},…bk_{k})) = (a1_{1}ak1_{k-1} |b1_{1}bk1_{k-1})

Suffix((a1_{1}ak_{k} |b1_{1},…bk_{k})) = (a2_{2}ak_{k} |b2_{2}b1_{1})

For example Prefix((GAC | TCA)) = (GA | TC) and Suffix((GAC | TCA)) = (AC | CA). Note that for consecutive (k, d)-mers appearing in Text, the suffix of the first (k, d)-mer is equal to the prefix of the second (k, d)-mer. For example for the consecutive (k, d)-mers (TAA | GCC) and (AAT | CCA) in TAATGCCATGGGATGTT.

Suffix((TAA | GCC)) = Prefix((AAT | CCA)) = (AA | CC)

Constructing a path graph

Given a string Text, we construct a graph PathGraphk,d_{k, d} (Text) that represents a path formed by |Text| − (k + d + k) + 1 edges corresponding to all (k, d)-mers in Text. We label edges in this path by (k, d)-mers and label the starting and ending nodes of an edge by its prefix and suffix, respectively (see figure given below).

Get hands-on with 1400+ tech skills courses.