Design a Distributed Search
Learn to design a distributed search system.
What is a search system?
Behind every search bar, there is a search system. A search system is a system that takes some text input, a search query, from the user and returns the relevant content in a few seconds or less. There are three main components of a search system, namely:
A crawler, which fetches content and creates
.documents For a search engine, a document consists of the text extracted from a web page. In a movie store’s web page, a document could be a JSON object containing titles, descriptions, and other metadata of the videos upon which we want to perform search queries. The documents could be JSON or any other suitable format. Documents are stored on a distributed storage like S3 or HDFS. An indexer, which builds a searchable index.
A searcher, which responds to search queries by running the search query on the index created by the indexer.
Note: We have a separate lesson dedicated to the explanation of the crawler component. In this lesson, we’ll focus on indexing.
Requirements
Let’s go over the functional and non-functional requirements of a distributed search system.
Functional requirements
The following is a functional requirement of a distributed search system:
- Search: Users should get relevant content based on their search queries.
Non-functional requirements
Here are the non-functional requirements of a distributed search system:
- Availability: The system should be highly available to the users.
- Scalability: The system should have the ability to scale with the increasing amount of data. In other words, it should be able to index a large amount of data.
- Fast search on big data: The user should get the results quickly, no matter how much content they are searching.
- Reduced cost: The overall cost of building a search system should be low.
We need a distributed storage in our design. Therefore, we can use the blob store, a previously discussed building block, to store the data to be indexed and the index itself. We’ll use the generic term “distributed storage” instead of the specific term “blob store.”
Indexing
Indexing is the organization and manipulation of data that’s done to facilitate fast and accurate information retrieval. However, running search queries on billions of documents that are document-level indexed will be a slow process, which may take many minutes, or even hours. Therefore, we will employ an inverted index as explained below.
Inverted index
An inverted index is a HashMap-like data structure that employs a document-term matrix. Instead of storing the complete document as it is, it splits the documents into individual words. After this, the
For each term, the index computes the following information:
- The list of documents in which the term appeared.
- The frequency with which the term appears in each document.
- The position of the term in each document.
Inverted Index
Term | Mapping ( [doc], [freq], [[loc]) |
elasticsearch | ( [1, 2, 3], [1, 1, 1], [[1], [1], [1]] ) |
distributed | ( [1, 3], [1, 1], [[4], [4]] ) |
restful | ( [1], [1], [[5]] ) |
stack | ( [1], [1], [[16]] ) |
In the table above, the “Term” column contains all the unique terms that are extracted from all of the documents. Each entry in the “Mapping” column consists of three lists:
- A list of documents in which the term appeared.
- A list that counts the frequency with which the term appears in each document.
- A two-dimensional list that pinpoints the position of the term in each document. A term can appear multiple times in a single document, which is why a