Advanced Data Structures: Implementing Tries in C++ and Java/

...

/

Basics of Tries

As we've seen before, a trie is a tree-based, ordered data structure that stores associative data structures, primarily strings.

There are many definitions available on the internet for tries, but here we'll try to learn the intuition behind the data structure, its usage, and its practical applications.

Intuition

In this lesson, we'll learn about the intuition behind the idea of tries and how other data structures could be used to solve similar problems. If you know the basics of a trie, you can jump to the next chapter.

Understanding prefixes

A prefix is a substring that occurs at the beginning of a string. A substring is a contiguous sequence of characters within a string. For example, the word "brick" has the following prefixes: "b", "br", "bri", "bric", and "brick".

Problem

Let's try to build a primary search feature. When a user types in a word, the software must show the user all the words which can be constructed using the current word as a prefix. The back-end server maintains a dictionary of valid words. The prefix string is sent to the back-end server, and the server returns the list of all the words with the current word as a prefix.

Note: If a user types in "bri",the server returns a list of strings containing "bri" as suggestions, such as "brick", "bright", "brisket", "bride", and more.

To analyze the performance of our solutions, let's define a few variables.

Length of the word typed by the user for searching = $W$ .
Total number of words present in the dictionary = $N$ .

In the section below, we've explained the possible ways to build this feature.

Possible solutions

Below we describe some possible solutions using well-known data structures and algorithms.

List-based solution

The simplest solution is to store all the words in a list. Then, whenever the server receives a prefix string from the user, it searches the entire list and returns all the words with the matching prefix. But this implementation is computation heavy and inefficient.

Time complexity

We iterate through a list of length $N$ and perform prefix matching for each dictionary word. Prefix matching takes time equivalent to the length of the prefix string, which is $W$ . Therefore, the total time complexity is $O(NW)$ .

Space complexity

Since we're storing $N$ words of size $W$ in the memory, the space complexity becomes $O(NW)$ .

Sorted list-based solution

Let's now consider an example where the prefix entered by the user is "bri", and the list of words stored in the server's memory is:

Time complexity

The time complexity for sorting the list of strings using a standard sorting algorithm like merge sort is $O(NlogN)$ . We iterate through the query string characterwise and perform a binary search for each character addition. Since there are $W$ letters in the query string, the time complexity for searching is $O(WlogN)$ . Assuming $N>>W$ , the average case time complexity becomes $O(NlogN)$ .

Space complexity

A sorting algorithm like merge sort would incur a space complexity of $O(N)$ . Also, since we're storing $N$ words of size $W$ in the memory, it would add a space complexity of $O(NW)$ . Hence, the total space complexity becomes $O(N+NW) \approx O(NW).$

Hashmap containing all prefixes

Another possible solution to the problem is by generating all the possible prefixes of all the words and storing them in a hashmap. Let’s say we have the words "brick", "bright", and "bride". We create a hashmap of all the prefixes generated from these words. The string P will be the key of the hashmap, and the list of words containing P as a prefix will be the value. The hashmap would look something like this:

Now, whenever a user types any string, we'll search for that string in our hashmap and return all the suitable matches for the prefix.

Time complexity

The average time complexity of the insertion of key-value pairs in a hashmap is $O(1)$ . The time required for generating the hash of the string is proportional to its length. So, the time complexity for searching a key in the hashmap is of the order of size of the query string, which is $O(W)$ .

Space complexity

For every possible prefix, multiple words are present as a value in the form of a list. Although this seems like a fast solution due to the involvement of hashmap, it's a nightmare in terms of memory utilization. New key-value pairs are created for all possible prefixes on adding new words, and already existing value lists must also be checked and modified. This adds a lot of memory requirements and time complexity.

When we get a prefix to search, we traverse down the tree, following the string character by character. Once all the characters in the prefix string are exhausted, all nodes below the current node are returned as an answer.

This approach is very efficient if you consider that search suggestions are continuous. Suppose a user types a search query "bri" and the server returns the list of words containing "bri" as a prefix. If the user keeps typing further to search for the string "bric" then we don't need to start again from the tree's root. Instead, we can keep traversing the tree from the last traversed point.

Other benefits of this approach include the capability to maintain details like the most searched prefix path and other additional information. These parameters are generally used to optimize the search suggestions.

Time complexity

The time complexity for the creation of a trie is $O(NW)$ , since all the $N$ words of size $W$ are inserted as nodes in the tree. Searching for a prefix is $O(W)$ , since we need to traverse nodes only until the size of the prefix, which can be at max $W$ .

Space complexity

We create new nodes for every character in every word. In the worst case, $N \times W$ new nodes are created. Hence, the space complexity becomes $O(NW)$ .

Introduction to Tries

Prefix Search

Suffix Search

Bitwise Tries

Pattern Matching

File Systems

Trie Traversal

Search Engine

Miscellaneous

Conclusion

Basics of Tries

Intuition

Understanding prefixes

Problem

Possible solutions

List-based solution

Time complexity

Space complexity

Sorted list-based solution

Time complexity

Space complexity

Hashmap containing all prefixes

Time complexity

Space complexity

Tree-based solution

Time complexity

Space complexity

Revisiting the definition of a trie