Feature #3: Find Dictionary

Implementing the "Find Dictionary" feature for our "Cyber Security" project.

Description

We use a network protocol that encrypts all application messages using a proprietary scheme. The encryption scheme has a unique property that the sequence of encrypted messages in a session appears to be in sorted order according to a secret dictionary. However, the dictionary is not transmitted for security purposes.

Before the sender starts transmitting actual messages, it sends several encrypted training messages to the receiver. The sender guarantees that the training messages will follow the lexicographic order according to the unknown dictionary.

The receiver must reverse engineer the training messages and generate the dictionary for future communication with the sender. If the order of the messages is invalid, the receiver generates an empty dictionary and asks the sender to retransmit the training messages.

For simplicity’s sake, we can assume that the encrypted contents of the messages only consist of English lowercase letters.

Let’s review a few examples below:

Solution

We can essentially map this problem to a graph problem but, before exploring the exact details of the solution, there are a few things that we need to keep in mind:

  • The letters within a message don’t tell us anything about the relative order. For example, the message educative in the list does not tell us that the letter e is before the letter d.

  • The input can contain messages followed by their prefix, for example, educated and then educate. These cases will never result in a valid alphabet (because in a valid alphabet, prefixes are always first). We’ll need to make sure our solution detects these cases correctly.

  • There can be more than one valid alphabet ordering. It is fine for our algorithm to return any one of them.

  • The output dictionary must contain all unique letters within the messages’ list, including those that could be in any position within the ordering. It should not contain any additional letters that were not in the input.

Now back to the graph problem part, we can break this particular problem into three parts:

  1. Extract the necessary information to identify the dependency rules from the messages. For example, in the messages provided in the slides above, ["decode", "interview"], the letter d comes before i.

  2. With the gathered information, we can put these dependency rules into a directed graph with the letters as nodes and the dependencies (order) as the edges.

  3. Lastly, we can sort the graph nodes topologically to generate the letter ordering (dictionary).

Let’s look at each part in more depth.

Part-1: Identifying the dependencies

Let’s start with an example of encrypted training messages and observe the initial ordering through simple reasoning:

Press + to interact
["mzosr", "mqov", "xxsvq", "xazv", "xazau", "xaqu", "suvzu", "suvxq", "suam", "suax", "rom", "rwx", "rwv"]

As in the English language dictionary, where all the words starting with a come at the start followed by the words starting with b, c, d, and so on, we can expect the first letters of each message to be in alphabetical order as well.

Press + to interact
["m", "m", "x", "x", "x", "x", "s", "s", "s", "s", "r", "r", "r"]

Removing the duplicates, we get:

Press + to interact
["m", "x", "s", "r"]

The following illustration might clarify this step:

Looking at the letters above, we know the relative order of these letters but do not know how these letters fit in with the rest of the letters. To get more information, we’ll need to look further into our English dictionary analogy. The word dirt comes before dorm. This is because we look at the second letter when the first letter is the same. In this case, i comes before o in the alphabet.

We can apply the same logic to our encrypted messages and look at the first two messages, mzsor and mqov. As the first letter is the same in both the messages, we look at the second letter. The first message has z, and the other second one has q. Therefore, we can safely say that z comes before q. We now have two fragments of the letter-order:

Press + to interact
["m", "x", "s", "r"], ["z", "q"]

We don’t know yet how these two fragments could fit together into a single ordering. For example, we don’t know whether m is before q, or q is before m, or even whether or not there’s enough information available in the input for us to know.

Anyway, we’ve now gotten all the information we can out of the first two words. All letters after z in mzosr, and after q in mqov, can be ignored because they do not impact the relative ordering of the two words. To better understand this, we can think back to dirt and dorm. Because i > o, the rt and rm parts are unimportant for determining alphabetical ordering.

Hopefully, we can see a pattern here. When two messages are adjacent, we need to look for the first difference between them. That difference tells us the relative order between two letters. Let’s have a look at all the relations we can extract by comparing adjacent messages:

Notice that we did not mention rules such as m -> a. This is fine because we can derive this relation from m -> x, x -> a.

This is it for the first part. Let’s put the pieces that we have in place.

Part-2: Representing the dependencies

We now have a set of relations mentioning the relative order of the pairs of letters:

Press + to interact
["z -> q", "m -> x", "x -> a", "x -> v", "x -> s", "z -> x", "v -> a", "s -> r", "o -> w"]

Now the question arises, how can we put these relations together? One might be tempted to start chaining all these together. Let’s look at a few possible chains:

As we can observe from our chains above that some letters might appear in more than one chain and putting the chains into the output list one after the other won’t work. Some of the letters might be duplicated and would result in an invalid ordering.

Let’s try to visualize the relations better with the help of a graph. The nodes are the letters, and an edge between two letters, x and y represents that x is before y in the encrypted messages.

g x x v v x->v a a x->a s s x->s u u q q z z z->x z->q m m m->x o o w w o->w v->a r r s->r
Graph representation

Part-3: Generating the dictionary

As we can see from the graph, four of the letters have no incoming arrows. What this means is that there are no letters that have to come before any of these four.

Remember that there could be multiple valid dictionaries, and if there are, then it’s fine for us to return any of them.

Therefore, a valid start to the ordering we return would be:

["o", "m", "u", "z"]

We can now remove these letters and edges from the graph because any other letters that required them first will now have this requirement satisfied.

g x x v v x->v a a x->a s s x->s q q w w v->a r r s->r
Remove the first order letters

On this new graph, there are now three new letters that have no in-arrows. We can add these to our output list.

["o", "m", "u", "z", "x", "q", "w"]

Again, we can remove these from the graph.

g v v a a v->a s s r r s->r
Remove the second order letters

Then add the two new letters with no in-arrows.

["o", "m", "u", "z", "x", "q", "w", "v", "s"]

This leaves the following graph:

g a a r r
Remove the remaining letters

We can place the final two letters in our output list and return the ordering:

["o", "m", "u", "z", "x", "q", "w", "v", "s", "a", "r"]

Let’s now review how we can implement this approach next.

Algorithm

Identifying the dependencies and representing them in the form of a graph is pretty straightforward. We extract the relations and insert them into an adjacency list:

Next, we need to generate the dictionary from the extracted relations: identify the letters (nodes) with no incoming links. Identifying whether a particular letter (node) has any incoming links or not from our adjacency list format can be a little complicated. A naive approach would be to repeatedly iterate over the adjacency lists of all the other nodes and check whether or not they contain a link to that particular node.

This naive method would be fine for our case, but perhaps we can do it more optimally.

An alternative is to keep two adjacency lists:

  • One with the same contents as the one above
  • One reversed that shows the incoming links

This way, every time we traverse an edge, we can remove the corresponding edge from the reversed adjacency list:

What if we told you that we could do better than this? Instead of tracking the incoming links for all the letters from a particular letter, we can track the count of how many incoming edges there are. We can keep the indegree count of all the letters along with the forward adjacency list.

Indegree corresponds to the number of incoming edges of a node.

Let’s see how that might look like below:

Now, we can decrement the indegree count of a node instead of removing it from the reverse adjacency list. When the indegree of the node reaches 0, this represents that this particular node has no incoming links left.

We perform BFS on all the letters that are reachable, i.e., the indegree count of the letters is zero. A letter is only reachable once the letters that need to be before it have been added to the output, result.

We use a queue to keep track of reachable nodes and perform BFS on them. Initially, we put the letters that have zero indegree count. We keep adding the letters to the queue as their indegree counts become zero.

We continue this until the queue is empty. Next, we check whether all the letters in the messages have been added to the output or not. This would only happen when some letters still have some incoming edges left, which means there is a cycle. In this case, we return "".

Remember that there can be letters that do not have any incoming edges. This can result in different orderings for the same set of messages, and that’s alright.

Let’s try to visualize the algorithm with the help of a set of slides below:

Let’s review the actual implementation code below:

Press + to interact
defmodule Solution do
alias Okasaki.Deque, as: Deque
def update_data_structures(adj_list, counts, message) do
case String.next_grapheme(message) do
nil -> {adj_list, counts}
{c, rest} ->
counts = Map.put(counts, c, 0)
adj_list = Map.put(adj_list, c, MapSet.new())
update_data_structures(adj_list, counts, rest)
end
end
def find_edges(adj_list, counts, []), do: {:ok, adj_list, counts}
def find_edges(adj_list, counts, [_message1]), do: {:ok, adj_list, counts}
def find_edges(adj_list, counts, [message1, message2 | rest]) do
cond do
byte_size(message1) > byte_size(message2) and String.contains?(message1, message2) -> {:empty, adj_list, counts}
true ->
# Find the first non matching letter and insert the corresponding relation.
{adj_list, counts, _, _} =
0..min(byte_size(message1), byte_size(message2))-1
|> Enum.reduce_while({adj_list, counts, message1, message2},
fn(_, {adj_list, counts, message1, message2}) ->
<<first_m1::utf8, rest_m1::binary>> = message1
<<first_m2::utf8, rest_m2::binary>> = message2
cond do
!String.equivalent?(<<first_m1::utf8>>, <<first_m2::utf8>>) ->
cond do
!MapSet.member?(adj_list[<<first_m1::utf8>>], <<first_m2::utf8>>) ->
counts = Map.put(counts, <<first_m2::utf8>>, counts[<<first_m2::utf8>>]+1)
new_set = MapSet.new([<<first_m2::utf8>>])
adj_list = Map.update(adj_list, <<first_m1::utf8>>, new_set, fn curr_set ->
MapSet.union(curr_set, new_set)
end)
{:halt, {adj_list, counts, <<rest_m1::binary>>, <<rest_m2::binary>>}}
true -> {:halt, {adj_list, counts, <<rest_m1::binary>>, <<rest_m2::binary>>}}
end
true -> {:cont, {adj_list, counts, <<rest_m1::binary>>, <<rest_m2::binary>>}}
end
end)
find_edges(adj_list, counts, [message2 | rest])
end
end
def queue_check(adj_list, counts, queue, result) do
cond do
Deque.size(queue) > 0 ->
{:ok, {c, queue}} = Deque.remove_left(queue)
result = [c | result]
{counts, queue} =
adj_list[c]
|> Enum.reduce({counts, queue}, fn(next, {counts, queue}) ->
counts = Map.put(counts, next, counts[next]-1)
queue = if counts[next] == 0, do: Deque.insert_right(queue, next), else: queue
{counts, queue}
end)
queue_check(adj_list, counts, queue, result)
true -> {counts, result}
end
end
def find_dictionary(messages) do
# Step 0: Create data structures and find all unique letters.
adj_list = %{}
counts = %{}
{adj_list, counts} =
messages |> Enum.reduce({adj_list, counts}, fn(message, {adj_list, counts}) ->
update_data_structures(adj_list, counts, message)
end)
# Step 1: Find all edges.
{check, adj_list, counts} = find_edges(adj_list, counts, messages)
cond do
check == :empty -> ""
true ->
# Step 2: Breadth-first search.
result = []
queue = Deque.new()
queue =
counts
|> Enum.reduce(queue, fn({k, v}, queue) ->
if v == 0, do: Deque.insert_right(queue, k), else: queue
end)
{counts, result} = queue_check(adj_list, counts, queue, result)
# # If not all letters are in result, that means there was a cycle and so
# # no valid ordering. Return "".
# # Otherwise, convert the ordering we found into a string and return it.
if length(result) < map_size(counts), do: "", else: :lists.reverse(result) |> Enum.join("")
end
end
end
# Driver Code
IO.puts "-----------------------------"
IO.puts "PROGRAM OUTPUT:"
# Example - 1
messages = ["mzosr", "mqov", "xxsvq", "xazv", "xazau", "xaqu", "suvzu", "suvxq", "suam", "suax", "rom", "rwx", "rwv"]
IO.puts "Dictionary = #{Solution.find_dictionary(messages)}"
# Example - 2
messages = ["vanilla", "alpine", "algor", "port", "norm", "nylon", "ophellia", "hidden"]
IO.puts "Dictionary = #{Solution.find_dictionary(messages)}"

Complexity measures

Time Complexity Space Complexity
O(c)O(c) O(1)O(1)

Let nn be the total number of messages in the input list.

Let cc be the total length of all the messages in the input list, added together.

Let uu be the total number of unique letters in the messages. While this is limited to 2626 in our case, we’ll still look at how it would impact the complexity if this was not the case.

Time complexity

There are three parts to the algorithm:

  • identifying all the relations
  • putting them into an adjacency list
  • converting it into a valid alphabet ordering

In the worst case, the identification and ...