Basics of Tries
Get an introduction to the basics of tries and the intuition behind the data structure.
We'll cover the following
As we've seen before, a trie is a tree-based, ordered data structure that stores associative data structures, primarily strings.
There are many definitions available on the internet for tries, but here we'll try to learn the intuition behind the data structure, its usage, and its practical applications.
Intuition
In this lesson, we'll learn about the intuition behind the idea of tries and how other data structures could be used to solve similar problems. If you know the basics of a trie, you can jump to the next chapter.
Understanding prefixes
A prefix is a substring that occurs at the beginning of a string. A substring is a contiguous sequence of characters within a string. For example, the word "brick" has the following prefixes: "b", "br", "bri", "bric", and "brick".
Problem
Let's try to build a primary search feature. When a user types in a word, the software must show the user all the words which can be constructed using the current word as a prefix. The back-end server maintains a dictionary of valid words. The prefix string is sent to the back-end server, and the server returns the list of all the words with the current word as a prefix.
Note: If a user types in "bri",the server returns a list of strings containing "bri" as suggestions, such as "brick", "bright", "brisket", "bride", and more.
To analyze the performance of our solutions, let's define a few variables.
Length of the word typed by the user for searching =
. Total number of words present in the dictionary =
.
In the section below, we've explained the possible ways to build this feature.
Possible solutions
Below we describe some possible solutions using well-known data structures and algorithms.
List-based solution
The simplest solution is to store all the words in a list. Then, whenever the server receives a prefix string from the user, it searches the entire list and returns all the words with the matching prefix. But this implementation is computation heavy and inefficient.
for word in words_list{// isPrefix(p, s) is function that// returns true ID p is a prefix of s, else falseif (isPrefix(pre, word))suggestions.add(word)}return suggestions
Time complexity
We iterate through a list of length
Space complexity
Since we're storing
Sorted list-based solution
Let's now consider an example where the prefix entered by the user is "bri", and the list of words stored in the server's memory is:
["apple", "ant", "bin", "back", "ball", "big", "brick", "bribe", "broke", "bride", "brag", "bridge", "bright", "brisket", "yellow", "wood"]
In this case, it's certain that the words not starting with the letter "b" are invalid suggestions. This means sorting the list of words in alphabetical order can ease the searching procedure.
After sorting the list, we can search for all the words starting with "b" using binary search (applying lower- and upper-bound functions). Now our search space is reduced to the list given below:
["bin", "back", "ball", "big", "brick", "bribe", "broke", "bride", "brag", "bridge", "bright", "brisket"]
To search for words with the prefix "br", we can apply binary search again to get the upper and lower bounds of all words with the second character being "r". The final output would be as given below:
["brag", "bribe", "brick", "bride", "bridge", "bright", "brisket", "broke"]
And finally, with another search of the third character as "i", the output would look like the below list:
["bribe", "brick", "bride", "bridge", "bright", "brisket"]
The pseudocode below summarizes the steps involved in this approach:
//sort the listsorted_list = sort(words_list)for (i in range 0 to len(prefix)){prefix_to_search = substr(0, i)//binary search to find start and end indicesstart_index = lower_bound(sorted_list, prefix_to_search)end_index = upper_bound(sorted_list, prefix_to_search)// keep reducing the search spacesorted_list = sorted_list[start_index : end_index]}return sorted_list
Time complexity
The time complexity for sorting the list of strings using a standard sorting algorithm like merge sort is
Space complexity
A sorting algorithm like merge sort would incur a space complexity of
Hashmap containing all prefixes
Another possible solution to the problem is by generating all the possible prefixes of all the words and storing them in a hashmap. Let’s say we have the words "brick", "bright", and "bride". We create a hashmap of all the prefixes generated from these words. The string P
will be the key of the hashmap, and the list of words containing P
as a prefix will be the value. The hashmap would look something like this:
{
"b"
: ["brick", "bright", "bride"]
"br"
: ["brick", "bright", "bride"]
"bri"
: ["brick", "bright", "bride"]
"bric"
: ["brick"]
"brick"
: ["brick"]
"brid"
: ["bride"]
"bride"
: ["bride"]
"brig"
: ["bright"]
"brigh"
: ["bright"]
"bright"
: ["bright"]
}
Now, whenever a user types any string, we'll search for that string in our hashmap and return all the suitable matches for the prefix.
Time complexity
The average time complexity of the insertion of key-value pairs in a hashmap is
Space complexity
For every possible prefix, multiple words are present as a value in the form of a list. Although this seems like a fast solution due to the involvement of hashmap, it's a nightmare in terms of memory utilization. New key-value pairs are created for all possible prefixes on adding new words, and already existing value lists must also be checked and modified. This adds a lot of memory requirements and time complexity.
Tree-based solution
A somewhat different approach is to store the words in a tree or a linked-list fashion, where a node represents a single character of the string, and the next node is the next character in the string. They share the same root and hence form a tree structure. For example, for the word list ["brick", "bribe", "bride", "bridge", "bright", "brisket"]
, the tree structure would look something like the below:
When we get a prefix to search, we traverse down the tree, following the string character by character. Once all the characters in the prefix string are exhausted, all nodes below the current node are returned as an answer.
This approach is very efficient if you consider that search suggestions are continuous. Suppose a user types a search query "bri" and the server returns the list of words containing "bri" as a prefix. If the user keeps typing further to search for the string "bric" then we don't need to start again from the tree's root. Instead, we can keep traversing the tree from the last traversed point.
Other benefits of this approach include the capability to maintain details like the most searched prefix path and other additional information. These parameters are generally used to optimize the search suggestions.
Time complexity
The time complexity for the creation of a trie is
Space complexity
We create new nodes for every character in every word. In the worst case,
Revisiting the definition of a trie
Let's revisit the definition of a trie and check if it makes more sense to us than before. The idea of the trie is very similar to the tree-based solution above. If we put all the string characters in a tree, with each node representing one character, the tree is called a trie.
There are multiple applications of tries apart from prefix searching. Also, tries can be implemented differently depending on the use case.