Feature #3: Find Dictionary
Implementing the "Find Dictionary" feature for our "Cyber Security" project.
Description
We use a network protocol that encrypts all application messages using a proprietary scheme. The encryption scheme has a unique property that the sequence of encrypted messages in a session appears to be in sorted order according to a secret dictionary. However, the dictionary is not transmitted for security purposes.
Before the sender starts transmitting actual messages, it sends several encrypted training messages to the receiver. The sender guarantees that the training messages will follow the lexicographic order according to the unknown dictionary.
The receiver must reverse engineer the training messages and generate the dictionary for future communication with the sender. If the order of the messages is invalid, the receiver generates an empty dictionary and asks the sender to retransmit the training messages.
For simplicity’s sake, we can assume that the encrypted contents of the messages only consist of English lowercase letters.
Let’s review a few examples below:
Solution
We can essentially map this problem to a graph problem but, before exploring the exact details of the solution, there are a few things that we need to keep in mind:
-
The letters within a message don’t tell us anything about the relative order. For example, the message
educative
in the list does not tell us that the lettere
is before the letterd
. -
The input can contain messages followed by their prefix, for example,
educated
and theneducate
. These cases will never result in a valid alphabet (because in a valid alphabet, prefixes are always first). We’ll need to make sure our solution detects these cases correctly. -
There can be more than one valid alphabet ordering. It is fine for our algorithm to return any one of them.
-
The output dictionary must contain all unique letters within the messages’ list, including those that could be in any position within the ordering. It should not contain any additional letters that were not in the input.
Now back to the graph problem part, we can break this particular problem into three parts:
-
Extract the necessary information to identify the dependency rules from the messages. For example, in the messages provided in the slides above,
["decode", "interview"]
, the letterd
comes beforei
. -
With the gathered information, we can put these dependency rules into a directed graph with the letters as nodes and the dependencies (order) as the edges.
-
Lastly, we can sort the graph nodes topologically to generate the letter ordering (dictionary).
Let’s look at each part in more depth.
Part-1: Identifying the dependencies
Let’s start with an example of encrypted training messages and observe the initial ordering through simple reasoning:
["mzosr", "mqov", "xxsvq", "xazv", "xazau", "xaqu", "suvzu", "suvxq", "suam", "suax", "rom", "rwx", "rwv"]
As in the English language dictionary, where all the words starting with a
come at the start followed by the words starting with b
, c
, d
, and so on, we can expect the first letters of each message to be in alphabetical order as well.
["m", "m", "x", "x", "x", "x", "s", "s", "s", "s", "r", "r", "r"]
Removing the duplicates, we get:
["m", "x", "s", "r"]
The following illustration might clarify this step:
Looking at the letters above, we know the relative order of these letters but do not know how these letters fit in with the rest of the letters. To get more information, we’ll need to look further into our English dictionary analogy. The word dirt
comes before dorm
. This is because we look at the second letter when the first letter is the same. In this case, i
comes before o
in the alphabet.
We can apply the same logic to our encrypted messages and look at the first two messages, mzsor
and mqov
. As the first letter is the same in both the messages, we look at the second letter. The first message has z
, and the other second one has q
. Therefore, we can safely say that z
comes before q
. We now have two fragments of the letter-order:
["m", "x", "s", "r"], ["z", "q"]
We don’t know yet how these two fragments could fit together into a single ordering. For example, we don’t know whether m
is before q
, or q
is before m
, or even whether or not there’s enough information available in the input for us to know.
Anyway, we’ve now gotten all the information we can out of the first two words. All letters after z
in mzosr
, and after q
in mqov
, can be ignored because they do not impact the relative ordering of the two words. To better understand this, we can think back to dirt
and dorm
. Because i
> o
, the rt
and rm
parts are unimportant for determining alphabetical ordering.
Hopefully, we can see a pattern here. When two messages are adjacent, we need to look for the first difference between them. That difference tells us the relative order between two letters. Let’s have a look at all the relations we can extract by comparing adjacent messages:
Notice that we did not mention rules such as
m -> a
. This is fine because we can derive this relation fromm -> x
,x -> a
.
This is it for the first part. Let’s put the pieces that we have in place.
Part-2: Representing the dependencies
We now have a set of relations mentioning the relative order of the pairs of letters:
["z -> q", "m -> x", "x -> a", "x -> v", "x -> s", "z -> x", "v -> a", "s -> r", "o -> w"]
Now the question arises, how can we put these relations together? One might be tempted to start chaining all these together. Let’s look at a few possible chains:
As we can observe from our chains above that some letters might appear in more than one chain and putting the chains into the output list one after the other won’t work. Some of the letters might be duplicated and would result in an invalid ordering.
Let’s try to visualize the relations better with the help of a graph. The nodes are the letters, and an edge between two letters, x
and y
represents that x
is before y
in the encrypted messages.
Part-3: Generating the dictionary
As we can see from the graph, four of the letters have no incoming arrows. What this means is that there are no letters that have to come before any of these four.
Remember that there could be multiple valid dictionaries, and if there are, then it’s fine for us to return any of them.
Therefore, a valid start to the ordering we return would be:
["o", "m", "u", "z"]
We can now remove these letters and edges from the graph because any other letters that required them first will now have this requirement satisfied.
On this new graph, there are now three new letters that have no in-arrows. We can add these to our output list.
["o", "m", "u", "z", "x", "q", "w"]
Again, we can remove these from the graph.
Then add the two new letters with no in-arrows.
["o", "m", "u", "z", "x", "q", "w", "v", "s"]
This leaves the following graph:
We can place the final two letters in our output list and return the ordering:
["o", "m", "u", "z", "x", "q", "w", "v", "s", "a", "r"]
Let’s now review how we can implement this approach next.
Algorithm
Identifying the dependencies and representing them in the form of a graph is pretty straightforward. We extract the relations and insert them into an adjacency list:
Next, we need to generate the dictionary from the extracted relations: identify the letters (nodes) with no incoming links. Identifying whether a particular letter (node) has any incoming links or not from our adjacency list format can be a little complicated. A naive approach would be to repeatedly iterate over the adjacency lists of all the other nodes and check whether or not they contain a link to that particular node.
This naive method would be fine for our case, but perhaps we can do it more optimally.
An alternative is to keep two adjacency lists:
- One with the same contents as the one above
- One reversed that shows the incoming links
This way, every time we traverse an edge, we can remove the corresponding edge from the reversed adjacency list:
What if we told you that we could do better than this? Instead of tracking the incoming links for all the letters from a particular letter, we can track the count of how many incoming edges there are. We can keep the indegree count of all the letters along with the forward adjacency list.
Indegree corresponds to the number of incoming edges of a node.
Let’s see how that might look like below:
Now, we can decrement the indegree count of a node instead of removing it from the reverse adjacency list. When the indegree of the node reaches 0
, this represents that this particular node has no incoming links left.
We perform BFS on all the letters that are reachable, i.e., the indegree count of the letters is zero. A letter is only reachable once the letters that need to be before it have been added to the output, result
.
We use a queue
to keep track of reachable nodes and perform BFS on them. Initially, we put the letters that have zero indegree count. We keep adding the letters to the queue as their indegree counts become zero.
We continue this until the queue is empty. Next, we check whether all the letters in the messages have been added to the output or not. This would only happen when some letters still have some incoming edges left, which means there is a cycle. In this case, we return ""
.
Remember that there can be letters that do not have any incoming edges. This can result in different orderings for the same set of messages, and that’s alright.
Let’s try to visualize the algorithm with the help of a set of slides below:
Let’s review the actual implementation code below:
defmodule Solution doalias Okasaki.Deque, as: Dequedef update_data_structures(adj_list, counts, message) docase String.next_grapheme(message) donil -> {adj_list, counts}{c, rest} ->counts = Map.put(counts, c, 0)adj_list = Map.put(adj_list, c, MapSet.new())update_data_structures(adj_list, counts, rest)endenddef find_edges(adj_list, counts, []), do: {:ok, adj_list, counts}def find_edges(adj_list, counts, [_message1]), do: {:ok, adj_list, counts}def find_edges(adj_list, counts, [message1, message2 | rest]) docond dobyte_size(message1) > byte_size(message2) and String.contains?(message1, message2) -> {:empty, adj_list, counts}true -># Find the first non matching letter and insert the corresponding relation.{adj_list, counts, _, _} =0..min(byte_size(message1), byte_size(message2))-1|> Enum.reduce_while({adj_list, counts, message1, message2},fn(_, {adj_list, counts, message1, message2}) -><<first_m1::utf8, rest_m1::binary>> = message1<<first_m2::utf8, rest_m2::binary>> = message2cond do!String.equivalent?(<<first_m1::utf8>>, <<first_m2::utf8>>) ->cond do!MapSet.member?(adj_list[<<first_m1::utf8>>], <<first_m2::utf8>>) ->counts = Map.put(counts, <<first_m2::utf8>>, counts[<<first_m2::utf8>>]+1)new_set = MapSet.new([<<first_m2::utf8>>])adj_list = Map.update(adj_list, <<first_m1::utf8>>, new_set, fn curr_set ->MapSet.union(curr_set, new_set)end){:halt, {adj_list, counts, <<rest_m1::binary>>, <<rest_m2::binary>>}}true -> {:halt, {adj_list, counts, <<rest_m1::binary>>, <<rest_m2::binary>>}}endtrue -> {:cont, {adj_list, counts, <<rest_m1::binary>>, <<rest_m2::binary>>}}endend)find_edges(adj_list, counts, [message2 | rest])endenddef queue_check(adj_list, counts, queue, result) docond doDeque.size(queue) > 0 ->{:ok, {c, queue}} = Deque.remove_left(queue)result = [c | result]{counts, queue} =adj_list[c]|> Enum.reduce({counts, queue}, fn(next, {counts, queue}) ->counts = Map.put(counts, next, counts[next]-1)queue = if counts[next] == 0, do: Deque.insert_right(queue, next), else: queue{counts, queue}end)queue_check(adj_list, counts, queue, result)true -> {counts, result}endenddef find_dictionary(messages) do# Step 0: Create data structures and find all unique letters.adj_list = %{}counts = %{}{adj_list, counts} =messages |> Enum.reduce({adj_list, counts}, fn(message, {adj_list, counts}) ->update_data_structures(adj_list, counts, message)end)# Step 1: Find all edges.{check, adj_list, counts} = find_edges(adj_list, counts, messages)cond docheck == :empty -> ""true -># Step 2: Breadth-first search.result = []queue = Deque.new()queue =counts|> Enum.reduce(queue, fn({k, v}, queue) ->if v == 0, do: Deque.insert_right(queue, k), else: queueend){counts, result} = queue_check(adj_list, counts, queue, result)# # If not all letters are in result, that means there was a cycle and so# # no valid ordering. Return "".# # Otherwise, convert the ordering we found into a string and return it.if length(result) < map_size(counts), do: "", else: :lists.reverse(result) |> Enum.join("")endendend# Driver CodeIO.puts "-----------------------------"IO.puts "PROGRAM OUTPUT:"# Example - 1messages = ["mzosr", "mqov", "xxsvq", "xazv", "xazau", "xaqu", "suvzu", "suvxq", "suam", "suax", "rom", "rwx", "rwv"]IO.puts "Dictionary = #{Solution.find_dictionary(messages)}"# Example - 2messages = ["vanilla", "alpine", "algor", "port", "norm", "nylon", "ophellia", "hidden"]IO.puts "Dictionary = #{Solution.find_dictionary(messages)}"
Complexity measures
Time Complexity | Space Complexity |
---|---|
Let be the total number of messages in the input list.
Let be the total length of all the messages in the input list, added together.
Let be the total number of unique letters in the messages. While this is limited to in our case, we’ll still look at how it would impact the complexity if this was not the case.
Time complexity
There are three parts to the algorithm:
- identifying all the relations
- putting them into an adjacency list
- converting it into a valid alphabet ordering
In the worst case, the identification and ...