Feature #4: Compress File
Implementing the "Compress File" feature for our "Operating System" project.
We'll cover the following...
Description
We have to come up with a compression strategy for text files that store our system’s information. Here is our strategy: whenever we see a word in a file composed as a concatenation of other smaller words in the same file, we will encode it with the smaller words’ IDs. For example, if we have the words n
, cat
, cats
, dog
, and catsndog
in a text file. The word catsndog
is the concatenation of the words n
, cats
, and dog
, so we can assign it an ID.
This way, instead of storing the entire catsndog
word as is, we are storing an ID that takes much less space.
We’ll be provided a list of strings representing all the words from a text file. Our task will be to identify and isolate all the concatenated words.
Solution
We’ll traverse the list of strings, and for each string, we’ll check every single combination. For example, some combinations of the word catsndog
are (c, atsndog)
, (ca, tsndog)
, (cat, sndog)
, (cats, ndog)
, etc. For each combination, we get two words. We can call the first word as prefix
and the second word as suffix
. Now, for a combination to be considered a concatenated word, both the prefix
and suffix
should be present in our list of strings. We’ll first perform the check for prefix
because there is no need to check the suffix
if the prefix is not present, and we can move to the next combination.
If the prefix
word is present in our list of strings, then we move to check the suffix
word. If the suffix
word is also present, we have found the concatenated word. Otherwise, there can be two possibilities:
-
The
suffix
word is a concatenation of two or more words. We will recursively call the same function on the suffix word to generate more(prefix, suffix)
pairs until asuffix
that is present in our list of strings is found. -
There is no such
suffix
word present in our list of strings, and our current combination is not a concatenated word.
When we break down the first suffix
word, it breaks that word down to the last letter to check it in our list of strings. After this, we move to the next combination. This breakdown and search process occurs in DFS fashion which helps us form a tree-like structure for all words.
Now, for a single string, we generate multiple combinations, and for each of those combinations, we might recursively generate all consecutive sequences of the words. There will likely be an overlap in which we’ll compute the validity of a word multiple times, and this will increase our time complexity. For example, for the combination (c, atsndog)
, at some point, we will have to check the validity of the suffix
word combination (a, tsndog)
. Now, when we get a combination (ca, tsndog)
from a different iteration, we will again check the word tsndog
when it has already been checked. The easiest way to counter this extra computation time is to use a cache. This way, if we encounter the same word again, instead of calling an expensive DFS function, we can simply return its value from the cache. This technique is also known as memoization ...