What is a search system?

Behind every search bar, there is a search system. A search system is a system that takes some text input, a search query, from the user and returns the relevant content in a few seconds or less. There are three main components of a search system, namely:

A crawler, which fetches content and creates documentsFor a search engine, a document consists of the text extracted from a web page. In a movie store’s web page, a document could be a JSON object containing titles, descriptions, and other metadata of the videos upon which we want to perform search queries. The documents could be JSON or any other suitable format. Documents are stored on a distributed storage like S3 or HDFS..
An indexer, which builds a searchable index.
A searcher, which responds to search queries by running the search query on the index created by the indexer.

Note: We have a separate lesson dedicated to the explanation of the crawler component. In this lesson, we’ll focus on indexing.

Requirements

Let’s go over the functional and non-functional requirements of a distributed search system.

Functional requirements

The following is a functional requirement of a distributed search system:

Search: Users should get relevant content based on their search queries.

Non-functional requirements

Here are the non-functional requirements of a distributed search system:

Availability: The system should be highly available to the users.
Scalability: The system should have the ability to scale with the increasing amount of data. In other words, it should be able to index a large amount of data.
Fast search on big data: The user should get the results quickly, no matter how much content they are searching.
Reduced cost: The overall cost of building a search system should be low.

Indexing

Indexing is the organization and manipulation of data that’s done to facilitate fast and accurate information retrieval. However, running search queries on billions of documents that are document-level indexed will be a slow process, which may take many minutes, or even hours. Therefore, we will employ an inverted index as explained below.

Inverted index

An inverted index is a HashMap-like data structure that employs a document-term matrix. Instead of storing the complete document as it is, it splits the documents into individual words. After this, the document-term matrixA document-term matrix is a mathematical matrix that represents the frequency of terms in a list of documents. identifies unique words and discards frequently occurring words like “to,” “they,” “the,” “is,” and so on. Frequently occurring words like those are called terms. The document-term matrix maintains a term-level index through this identification of unique words and deletion of unnecessary terms.

For each term, the index computes the following information:

The list of documents in which the term appeared.
The frequency with which the term appears in each document.
The position of the term in each document.

In the table above, the “Term” column contains all the unique terms that are extracted from all of the documents. Each entry in the “Mapping” column consists of three lists:

A list of documents in which the term appeared.
A list that counts the frequency with which the term appears in each document.
A two-dimensional list that pinpoints the position of the term in each document. A term can appear multiple times in a single document, which is why a two-dimensional list is used.

Note: Instead of lists, the mappings could also be in the form of tuples, such as doc, freq, and loc.

The inverted index is one of the most popular index mechanisms used in document retrieval. It enables efficient ...

Term	Mapping ( [doc], [freq], [[loc])
elasticsearch	( [1, 2, 3], [1, 1, 1], [[1], [1], [1]] )
distributed	( [1, 3], [1, 1], [[4], [4]] )
restful	( [1], [1], [[5]] )
stack	( [1], [1], [[16]] )

Introduction

Elementary Design Problems

Advanced Design Problems

Concluding Remarks

Design a Distributed Search