String Reconstruction with Gluing Nodes and De Bruijn Graphs
Let’s learn how we can reconstruct strings by gluing labeled nodes.
Gluing nodes and de Bruijn graphs
Let’s again represent the genome TAATGCCATGGGATGTT as a sequence of its 3-mers:
This time, instead of assigning these 3-mers to nodes, we’ll assign them to edges, as shown in the figure below. You can once again reconstruct the genome by following this path from left to right, adding one new nucleotide at each step. Since each pair of consecutive edges represent consecutive 3-mers that overlap in two nucleotides, we’ll label each node of this graph with a 2-mer representing the overlapping nucleotides shared by the edges on either side of the node. For example, the node with incoming edge CAT and outgoing edge ATG is labeled AT.
Gluing labeled nodes
Nothing seems new here until we start gluing identically labeled nodes. In the figure given below (top panels), we bring the three AT nodes closer and closer to each other until they’ve been glued into a single node. Note that there are also three nodes labeled by TG, which we glue together in the figure below (middle panels). Finally, we glue together the two nodes labeled GG (GG and GG), as shown in the figure below(bottom panels), which produces a special type of edge called a loop connecting GG to itself.
The number of nodes in the resulting graph, shown in the figure above (bottom right), has reduced from 16 to 11, while the number of edges stayed the same. This graph is called the de Bruijn graph of TAATGCCATGGGATGTT, denoted DeBruijn ...