Indexing

Understand what indexing is and its use in Distributed Search.

We will first describe what indexing is, and then we will make our way forward to distributing indexes over many nodes.

Indexing

Indexing is organizing and manipulating data to facilitate fast and accurate information retrieval.

Building a searchable index

The simplest way to build a searchable index is to assign a unique ID to each document and store it in a database table, as shown in the following table. The first column in the table is the ID of the text, and the second column contains the text in each document.

The above table would be very large depending on the number of documents that we have. The documents size could be very large. The text in the above table is just an example and consists of only one or two sentences, but practically every document could be pages long. Running a search query on the above document-level index is not so fast. On each search request, we have to traverse all the documents and count the occurrence of the search string in each document.

For fuzzy searchThis type of search uses approximate string matching rather than exact matching to match the results against the search term., we also have to perform different pattern matching queries. Many strings in the documents would somehow match the searched string. First, we must find the unique candidate strings by traversing all the documents. Then we have to see the most approximate matched string out of these strings. We also have to find the occurrence of the most matched string in each document. It will take a lot of time for each search query.

The response time to a search query depends on:

The data organization strategy in the database
Size of the data
The processing speed and RAM of the machine used to build the index and process the search query

Running search queries on billions of documents that are document-level indexed will be very slow (many minutes to possibly hours). Let’s look at another data organization and processing technique that will help reduce the search time.

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy

ID	Document content
1	Elasticsearch is the distributed, RESTful search and analytics engine at the heart of the Elastic Stack
2	Elasticsearch is a search engine based on the Lucene library.
3	Elasticsearch is a distributed search and analytics engine built on Apache Lucene.

Introduction

Abstractions

Non-functional System Characteristics

Back-of-the-Envelope Calculations

Building Blocks

Domain Name System (DNS)

Sequencer

Rate Limiter

Distributed Cache

Blob Store

Content Delivery Network (CDN)

Load Balancers

Key-Value Store

Distributed Messaging Queue

Pub-sub

Distributed Task Scheduler

Distributed Search

Distributed Logging

Distributed Monitoring

Monitoring Server Side Errors

Monitoring Client Side Errors

Databases

Sharded Counters

Concluding Building Blocks

Design YouTube

Design Quora

Design Google Maps

Designing a Proximity Server like Yelp

Design Uber

Design Twitter

Newsfeed System

Design Instagram

Design URL Shortening Service / TinyURL

Design a Web Crawler

Design WhatsApp

Design Typeahead Suggestion

Design Collaborative Document Editing Service / Google Docs

Spectacular Failures

Concluding Remarks

Appendix: System Design Interviews

All content below this will likely go away

Design Exercises

Archived temporary lessons

Design Resource Allocator for a Large Datacenter

Design Zoom

Continuous Monitoring using Data Processing

Design Live Commenting at Facebook

Security

For Noor: Placeholder for Illustration Making

Appendix

Backup of our Lessons

Caching Billions of Tiny Objects on Flash

Design Quora

Copy-Design YouTube

Identity & Access Management

Copy of CDN (02-03-2022)

Indexing

Indexing

Building a searchable index

Simple Document Index

Inverted index