Design of Distributed Search

Let us understand the design of a distributed search system that manages a large number of queries per second.

The crawler collects the content from the intended resource. For example, if we build a search for a YouTube application, the crawler will crawl through all of the videos on YouTube and extract textual content for each video. The content could be the title of the video, description, channel name, and maybe the video’s annotation to enable an intelligent search based not only on the title and description but also on the content of that video. The crawler formats the extracted content for each video in JSON and stores these JSON documents in a distributed storage.
The indexer fetches the documents from distributed storage and index these documents using MapReduceAs stated by Wikipedia, “MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster of commodity machines.” that runs on a distributed cluster of commodity machines. The indexer uses a distributed data processing system like MapReduce for parallel and distributed index construction. The constructed index tableTerms->Mappings is stored in distributed storage.
The distributed storage is used to store the documents and the index.
The user enters the search string containing multiple words in the search bar.
The searcher parses the search string, searches for the mappings from the index stored in the distributed storage, and returns the most matched results to the user. The searcher intelligently maps the incorrectly spelled words in the search string to the closest vocabulary words. It also looks for the documents that include all the words and ranks them.

API design

Since the user only sends requests in the form of a string, the API is quite simple.

Search: The search function runs when a user queries the system by entering text in a search bar.

search(query)

The query is the text entered by the user.

Detailed discussion

Since the indexer is the core component in a search system, we discussed an indexing technique and problems with centralized indexing in the previous lesson. In this lesson, we will consider a distributed solution for indexing and searching.

Distributed indexing and searching

Let’s see how we can develop a distributed index and searching system. We understand that the input to an indexing system is the documents we created during crawling. To develop an index in a distributed fashion, we employ a large number of low-cost machines (nodes) and partition/divide the documents based on the resources they have. All the nodes are connected, and the group of nodes is called a cluster.

Our idea behind indexing using numerous small nodes is to achieve cost efficiency. This requires us to partition/split the input data (documents) among these nodes. However, a key question is, how will we perform this partitioning?

The two most common techniques used for data partitioning in distributed indexing are the following:

Document partitioning: In document partitioning, all the documents collected by the web crawler are partitioned into subsets of documents. Each node will perform indexing on a subset of documents assigned to it.
Term partitioning: The dictionary of all terms is partitioned into subsets, each subset residing at a node. For example, a subset of documents will be processed/indexed by a node containing the term “search”.

In term partitioning, a search query is sent to the nodes that correspond to the query terms. This provides for more concurrency because a stream of search queries with different query terms will be served by different nodes. But, term partitioning turns out to be a difficult task in practice. Multi-word queries necessitate sending long mapping lists between groups of nodes for merging, which can be more expensive than the benefits from the increased concurrency.

In document partitioning, each query is distributed across all nodes, and results from these nodes are merged before being shown to the user. This method of partitioning necessitates less inter-node communication. In our design, we’ll use document partitioning.

Following document partitioning, let’s look into a distributed design for index construction and querying, shown in the following illustration. We are using a cluster consisting of a number of low-cost nodes and a cluster manager. The cluster manager uses a MapReduce programming model to parallelize the index’s computation on each partition. MapReduce can work on significantly larger datasets that are difficult to be handled by a single large server.

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy

Introduction

Abstractions

Non-functional System Characteristics

Back-of-the-Envelope Calculations

Building Blocks

Domain Name System (DNS)

Sequencer

Rate Limiter

Distributed Cache

Blob Store

Content Delivery Network (CDN)

Load Balancers

Key-Value Store

Distributed Messaging Queue

Pub-sub

Distributed Task Scheduler

Distributed Search

Distributed Logging

Distributed Monitoring

Monitoring Server Side Errors

Monitoring Client Side Errors

Databases

Sharded Counters

Concluding Building Blocks

Design YouTube

Design Quora

Design Google Maps

Designing a Proximity Server like Yelp

Design Uber

Design Twitter

Newsfeed System

Design Instagram

Design URL Shortening Service / TinyURL

Design a Web Crawler

Design WhatsApp

Design Typeahead Suggestion

Design Collaborative Document Editing Service / Google Docs

Spectacular Failures

Concluding Remarks

Appendix: System Design Interviews

All content below this will likely go away

Design Exercises

Archived temporary lessons

Design Resource Allocator for a Large Datacenter

Design Zoom

Continuous Monitoring using Data Processing

Design Live Commenting at Facebook

Security

For Noor: Placeholder for Illustration Making

Appendix

Backup of our Lessons

Caching Billions of Tiny Objects on Flash

Design Quora

Copy-Design YouTube

Identity & Access Management

Copy of CDN (02-03-2022)

Design of Distributed Search

High-level design

API design

Detailed discussion

Distributed indexing and searching