Requirements of a Distributed Search System's Design
Let's identify the requirements of a distributed search system and outline the resources we need.
Requirements
Let’s understand the functional and non-functional requirements of a distributed search system.
Functional requirements
The following is a functional requirement of a distributed search system:
- Search: Users should get relevant content based on their search queries.
Non-functional requirements
Here are the non-functional requirements of a distributed search system:
- Availability: The system should be highly available to the users.
- Scalability: The system should have the ability to scale with the increasing amount of data. In other words, it should be able to index a large amount of data.
- Fast search on big data: The user should get the results quickly, no matter how much content they are searching.
- Reduced cost: The overall cost of building a search system should be less.
Resource estimation
Let’s estimate the total number of servers, storage, and bandwidth that is required by the distributed search system. We’ll calculate these numbers using an example of a YouTube search.
Number of servers estimation
To estimate the number of servers, we need to know the number of daily active users of YouTube search feature. Let’s assume that we have 150 million daily active users of YouTube utilizing the search feature. Considering our assumption of using daily active users as a proxy for the number of requests per second to find the number of servers for peak load times, we get 150 million requests per second. Then, we use the following formula to calculate the number of servers:
Using 64,000 as an estimated RPS of a server from the Back-of-the-envelope Calculations chapter, the required servers are estimated as follows:
Note: Concurrent requests significantly impact the number of required servers compared to requests spread over time.
Storage estimation
Each video’s metadata is stored in a separate JSON document. Each document is uniquely identified by the video ID. This metadata contains the title of the video, its description, the channel name, and a transcript. We assume the following numbers for estimating the storage required to index one video:
- The size of a single JSON document is 200 KB.
- The number of unique terms or keys extracted from a single JSON document is 1,000.
- The amount of storage space required to add one term into the index table is 100 Bytes.
The following formula is used to compute the storage required to index one video:
...
Level up your interview prep. Join Educative to access 70+ hands-on prep courses.