Requirements of a Distributed Search System's Design

Let's identify the requirements of a distributed search system and outline the resources we need.

Requirements

Let’s understand the functional and non-functional requirements of a distributed search system.

Functional requirements

The following is a functional requirement of a distributed search system:

  • Search: Users should get relevant content based on their search queries.
Press + to interact
The functional requirement of a distributed search system
The functional requirement of a distributed search system

Non-functional requirements

Here are the non-functional requirements of a distributed search system:

  • Availability: The system should be highly available to the users.
  • Scalability: The system should have the ability to scale with the increasing amount of data. In other words, it should be able to index a large amount of data.
  • Fast search on big data: The user should get the results quickly, no matter how much content they are searching.
  • Reduced cost: The overall cost of building a search system should be less.
The non-functional requirement of a distributed search system

Resource estimation

Let’s estimate the total number of servers, storage, and bandwidth that is required by the distributed search system. We’ll calculate these numbers using an example of a YouTube search.

Number of servers estimation

To estimate the number of servers, we need to know the number of daily active users of YouTube search feature. Let’s assume that we have 150 million daily active users of YouTube utilizing the search feature. Considering our assumption of using daily active users as a proxy for the number of requests per second to find the number of servers for peak load times, we get 150 million requests per second. Then, we use the following formula to calculate the number of servers:

Using 64,000 as an estimated RPS of a server from the Back-of-the-envelope Calculations chapter, the required servers are estimated as follows:

Press + to interact
The number of servers required for the YouTube search service
The number of servers required for the YouTube search service

Note: Concurrent requests significantly impact the number of required servers compared to requests spread over time.

Storage estimation

Each video’s metadata is stored in a separate JSON document. Each document is uniquely identified by the video ID. This metadata contains the title of the video, its description, the channel name, and a transcript. We assume the following numbers for estimating the storage required to index one video:

  • The size of a single JSON document is 200 KB.
  • The number of unique terms or keys extracted from a single JSON document is 1,000.
  • The amount of storage space required to add one term into the index table is 100 Bytes.

The following formula is used to compute the storage required to index one video:

Totalstorage/video=Storage/doc ...

Level up your interview prep. Join Educative to access 70+ hands-on prep courses.