Decode the Coding Interview in Python: Real-World Examples/

...

/

Feature #3: Find Dictionary

Description

We use a network protocol that encrypts all application messages using a proprietary scheme. The encryption scheme has a unique property that the sequence of encrypted messages in a session appears to be in sorted order according to a secret dictionary. However, the dictionary is not transmitted for security purposes.

Before the sender starts transmitting actual messages, it sends several encrypted training messages to the receiver. The sender guarantees that the training messages will follow the lexicographic order according to the unknown dictionary.

The receiver must reverse engineer the training messages and generate the dictionary for future communication with the sender. If the order of the messages is invalid, the receiver generates an empty dictionary and asks the sender to retransmit the training messages.

For simplicity’s sake, we can assume that the encrypted contents of the messages only consist of English lowercase letters.

Let’s review a few examples below:

Solution

We can essentially map this problem to a graph problem but, before exploring the exact details of the solution, there are a few things that we need to keep in mind:

The letters within a message don’t tell us anything about the relative order. For example, the message educative in the list does not tell us that the letter e is before the letter d.
The input can contain messages followed by their prefix, for example, educated and then educate. These cases will never result in a valid alphabet (because in a valid alphabet, prefixes are always first). We’ll need to make sure our solution detects these cases correctly.
There can be more than one valid alphabet ordering. It is fine for our algorithm to return any one of them.
The output dictionary must contain all unique letters within the messages’ list, including those that could be in any position within the ordering. It should not contain any additional letters that were not in the input.

Now back to the graph problem part, we can break this particular problem into three parts:

Extract the necessary information to identify the dependency rules from the messages. For example, in the messages provided in the slides above, ["decode", "interview"], the letter d comes before i.
With the gathered information, we can put these dependency rules into a directed graph with the letters as nodes and the dependencies (order) as the edges.
Lastly, we can sort the graph nodes topologically to generate the letter ordering (dictionary).

Let’s look at each part in more depth.

Part-1: Identifying the dependencies

Let’s start with an example of encrypted training messages and observe the initial ordering through simple reasoning:

Looking at the letters above, we know the relative order of these letters but do not know how these letters fit in with the rest of the letters. To get more information, we’ll need to look further into our English dictionary analogy. The word dirt comes before dorm. This is because we look at the second letter when the first letter is the same. In this case, i comes before o in the alphabet.

We can apply the same logic to our encrypted messages and look at the first two messages, mzsor and mqov. As the first letter is the same in both the messages, we look at the second letter. The first message has z, and the other second one has q. Therefore, we can safely say that z comes before q. We now have two fragments of the letter-order:

We don’t know yet how these two fragments could fit together into a single ordering. For example, we don’t know whether m is before q, or q is before m, or even whether or not there’s enough information available in the input for us to know.

Anyway, we’ve now gotten all the information we can out of the first two words. All letters after z in mzosr, and after q in mqov, can be ignored because they do not impact the relative ordering of the two words. To better understand this, we can think back to dirt and dorm. Because i > o, the rt and rm parts are unimportant for determining alphabetical ordering.

Hopefully, we can see a pattern here. When two messages are adjacent, we need to look for the first difference between them. That difference tells us the relative order between two letters. Let’s have a look at all the relations we can extract by comparing adjacent messages:

Part-3: Generating the dictionary

As we can see from the graph, four of the letters have no incoming arrows. What this means is that there are no letters that have to come before any of these four.

Remember that there could be multiple valid dictionaries, and if there are, then it’s fine for us to return any of them.

Therefore, a valid start to the ordering we return would be:

["o", "m", "u", "z"]

We can now remove these letters and edges from the graph because any other letters that required them first will now have this requirement satisfied.

We can place the final two letters in our output list and return the ordering:

["o", "m", "u", "z", "x", "q", "w", "v", "s", "a", "r"]

Let’s now review how we can implement this approach next.

Algorithm

Identifying the dependencies and representing them in the form of a graph is pretty straightforward. We extract the relations and insert them into an adjacency list:

Next, we need to generate the dictionary from the extracted relations: identify the letters (nodes) with no incoming links. Identifying whether a particular letter (node) has any incoming links or not from our adjacency list format can be a little complicated. A naive approach would be to repeatedly iterate over the adjacency lists of all the other nodes and check whether or not they contain a link to that particular node.

This naive method would be fine for our case, but perhaps we can do it more optimally.

An alternative is to keep two adjacency lists:

One with the same contents as the one above
One reversed that shows the incoming links

This way, every time we traverse an edge, we can remove the corresponding edge from the reversed adjacency list:

Now, we can decrement the indegree count of a node instead of removing it from the reverse adjacency list. When the indegree of the node reaches 0, this represents that this particular node has no incoming links left.

We perform BFS on all the letters that are reachable, i.e., the indegree count of the letters is zero. A letter is only reachable once the letters that need to be before it have been added to the output, result.

We use a queue to keep track of reachable nodes and perform BFS on them. Initially, we put the letters that have zero indegree count. We keep adding the letters to the queue as their indegree counts become zero.

We continue this until the queue is empty. Next, we check whether all the letters in the messages have been added to the output or not. This would only happen when some letters still have some incoming edges left, which means there is a cycle. In this case, we return "".

Remember that there can be letters that do not have any incoming edges. This can result in different orderings for the same set of messages, and that’s alright.

Let’s try to visualize the algorithm with the help of a set of slides below:

from collections import defaultdict, Counter, deque
def find_dictionary(messages):
    # Step 0: Create data structures and find all unique letters.
    adj_list = defaultdict(set)
    counts = Counter({c : 0 for message in messages for c in message})
    
    # Step 1: We need to populate adj_list and counts.
    # For each pair of adjacent messages...
    for message1, message2 in zip(messages, messages[1:]):
        for c, d in zip(message1, message2):
            if c != d:
                if d not in adj_list[c]:
                    adj_list[c].add(d)
                    counts[d] += 1
                break
        else: # Check that second message isn't a prefix of first message.
            if len(message2) < len(message1): return ""
    
    # Step 2: We need to repeatedly pick off nodes with an indegree of 0.
    result = []
    queue = deque([c for c in counts if counts[c] == 0])
    while queue:
        c = queue.popleft()
        result.append(c)
        for d in adj_list[c]:
            counts[d] -= 1
            if counts[d] == 0:
                queue.append(d)
                
    # If not all letters are in result, that means there was a cycle and so
    # no valid ordering. Return "".
    if len(result) < len(counts):
        return ""
    # Otherwise, convert the ordering we found into a string and return it.
    return "".join(result)
# Example - 1
messages = ["mzosr", "mqov", "xxsvq", "xazv", "xazau", "xaqu", "suvzu", "suvxq", "suam", "suax", "rom", "rwx", "rwv"]
print("Dictionary = " ,find_dictionary(messages))
# Example - 2
messages = ["vanilla", "alpine", "algor", "port", "norm", "nylon", "ophellia", "hidden"]
print("Dictionary = ", find_dictionary(messages))

Find dictionary

Complexity measures

Time Complexity	Space Complexity
$O(c)$	$O(1)$

Let $n$ be the total number of messages in the input list.

Let $c$ be the total length of all the messages in the input list, added together.

Let $u$ be the total number of unique letters in the messages. While this is limited to $26$ in our case, we’ll still look at how it would impact the complexity if this was not the case.

Time complexity

There are three parts to the algorithm:

identifying all the relations
putting them into an adjacency list
converting it into a valid alphabet ordering

In the worst case, the identification and initialization parts require checking every letter of every word, which is $O (c)$ ...