Grokking the Modern System Design Interview/

...

Indexing in a Distributed Search

Learn about indexing and its use in a distributed search.

We'll cover the following...

Indexing
- Build a searchable index
  - Inverted index
  - Searching from an inverted index
- Factors of index design
Indexing on a centralized system

We’ll first describe what indexing is, and then we’ll make our way toward distributing indexes over many nodes.

Indexing

Indexing is the organization and manipulation of data that’s done to facilitate fast and accurate information retrieval.

Build a searchable index

The simplest way to build a searchable index is to assign a unique ID to each document and store it in a database table, as shown in the following table. The first column in the table is the ID of the text and the second column contains the text from each document.

The size of the table given above would vary, depending on the number of documents we have and the size of those documents. The table above is just an example, and the content from each document only consists of one or two sentences. With an actual, real-world example, the content of every document in the table could be pages long. This would make our table quite large. Running a search query on the document-level index given above isn’t a fast process. On each search request, we have to traverse all the documents and count the occurrence of the search string in each document.

Note: For a fuzzy searchThis type of search uses approximate string matching rather than exact matching to match the results against the search term., we also have to perform different pattern-matching queries. Many strings in the documents would somehow match the searched string. First, we must find the unique candidate strings by traversing all of the documents. Then, we must single out the most approximate matched string out of these strings. We also have to find the occurrence of the most matched string in each document. This means that each search query takes a long time.

The response time to a search query depends on a few factors:

The data organization strategy in the database.
The size of the data.
The processing speed and the RAM of the machine that’s used to build the index and process the search query.

Running search queries on billions of documents that are document-level indexed will be a slowprocess, which may take many minutes, or even hours. Let’s look at another data organization and processing technique that helps reduce the search time.

ID	Document Content
1	Elasticsearch is the distributed and analytics engine that is based on REST APIs.
2	Elasticsearch is a Lucene library-based search engine.
3	Elasticsearch is a distributed search and analytics engine built on Apache Lucene.

Distributed Cache System

Pub-Sub

Blob Store

TikTok

Uber Eats

NewsFeed

Facebook Messenger

ChatGPT

Indexing in a Distributed Search

Indexing

Build a searchable index

Simple Document Index

Inverted index