Indexing in a Distributed Search

Learn about indexing and its use in a distributed search.

We’ll first describe what indexing is, and then we’ll make our way toward distributing indexes over many nodes.

Indexing

Indexing is the organization and manipulation of data that’s done to facilitate fast and accurate information retrieval.

Build a searchable index

The simplest way to build a searchable index is to assign a unique ID to each document and store it in a database table, as shown in the following table. The first column in the table is the ID of the text and the second column contains the text from each document.

Simple Document Index

ID

Document Content

1

Elasticsearch is the distributed and analytics engine that is based on REST APIs.

2

Elasticsearch is a Lucene library-based search engine.

3

Elasticsearch is a distributed search and analytics engine built on Apache Lucene.

The size of the table given above would vary, depending on the number of documents we have and the size of those documents. The table above is just an example, and the content from each document only consists of one or two sentences. With an actual, real-world example, the content of every document in the table could be pages long. This would make our table quite large. Running a search query on the document-level index given above isn’t a fast process. On each search request, we have to traverse all the documents and count the occurrence of the search string in each document.

Note: For a fuzzy searchThis type of search uses approximate string matching rather than exact matching to match the results against the search term., we also have to perform different pattern-matching queries. Many strings in the documents would somehow match the searched string. First, we must find the unique candidate strings by traversing all of the documents. Then, we must single out the most approximate matched string out of these strings. We also have to find the occurrence of the most matched string in each document. This means that each search query takes a long time.

The response time to a search query depends on a few factors:

  • The data organization strategy in the database.
  • The size of the data.
  • The processing speed and the RAM of the machine that’s used to build the index and process the search query.

Running search queries on billions of documents that are ...

Level up your interview prep. Join Educative to access 70+ hands-on prep courses.