...

/

Indexing

Indexing

Understand what indexing is and its use in Distributed Search.

We will first describe what indexing is, and then we will make our way forward to distributing indexes over many nodes.

Indexing

Indexing is organizing and manipulating data to facilitate fast and accurate information retrieval.

Building a searchable index

The simplest way to build a searchable index is to assign a unique ID to each document and store it in a database table, as shown in the following table. The first column in the table is the ID of the text, and the second column contains the text in each document.

Simple Document Index

ID

Document content

1

Elasticsearch is the distributed, RESTful search and analytics engine at the heart of the Elastic Stack

2

Elasticsearch is a search engine based on the Lucene library.

3

Elasticsearch is a distributed search and analytics engine built on Apache Lucene.

The above table would be very large depending on the number of documents that we have. The documents size could be very large. The text in the above table is just an example and consists of only one or two sentences, but practically every document could be pages long. Running a search query on the above document-level index is not so fast. On each search request, we have to traverse all the documents and count the occurrence of the search string in each document.

For fuzzy searchThis type of search uses approximate string matching rather than exact matching to match the results against the search term., we also have to perform different pattern matching queries. Many strings in the documents would somehow match the searched string. First, we must find the unique candidate strings by traversing all the documents. Then we have to see the most approximate matched string out of these strings. We also have to find the occurrence of the most matched string in each document. It will take a lot of time for each search query.

The response time to a search query depends on:

  • The data organization strategy in the database
  • Size of the data
  • The processing speed and RAM of the machine used to build the index and process the search query

Running search queries on billions of documents that are document-level indexed will be very slow (many minutes to possibly hours). Let’s look at another data organization and processing technique that will help reduce the search time.

Inverted index

An inverted index is a hashmap-like data ...

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy