Introduction to Distributed Search [clone]
Get an introduction to a search system and identify the requirements.
Why search system?
Today, on almost every website, we see a search bar. We use that search bar to find the relevant content out of the huge amount of content on that website. It enables us to quickly find what we are looking for. For example, on Educative, there are plenty of courses. If we don’t have a search feature, the user has to scroll many pages and read the name of each course to find a particular one.
Let’s take another example. There are billions of videos uploaded and stored on Youtube. Imagine if Youtube doesn’t provide us the search bar, how will we find a specific video from so many videos. It will take months to navigate through all the videos. Users find it challenging to find what they’re looking for simply by scrolling around.
Search engines are an even bigger example. We have billions of websites on the Internet, each website has many web pages, and on each web page, there is plenty of content. With this much content, the Internet would be practically useless without search engines, and users would end up in a sea of irrelevant data. Search engines are essentially filters for the massive amount of data available on the Internet. They let users obtain information that is of true interest or worth quickly and simply, without having to sift through a large number of unnecessary web pages.
Behind every search bar, there is a search system.
What is a search system?
A search system is a system that takes some text input (search query) from the user and returns the relevant content in a few seconds or less. There are three main components of a search system, a crawler for fetching content and creating
The documents are stored on distributed storage like S3 or HDFS.
We have a separate chapter on the crawler. That’s why we will focus more on indexing in this chapter.
We need a search system that meets the following functional and non-functional requirements.
Functional requirements
The following is a functional requirement of a search system.
- Search: Users should get the relevant content based on the search query.
Non-functional requirements
Following are the non-functional requirements of a search system.
- Availability: The system should be highly available to the users.
- Scalability: The system should have the ability to scale with the increasing amount of data. In other words, it should be able to index a large amount of data.
- Fast search on big data: The user should get the results quickly no matter how much content they are searching from.
- Reduced cost: Overall cost of building a search system should be less.
Estimations
Let’s estimate the total number of servers, storage, and bandwidth required by the system. We will calculate these numbers by taking an example of a Youtube search.
Servers
To estimate the number of servers, we need to know how many daily active users per day are using the search feature on Youtube and how many requests per second our single server can handle. We assume the following numbers:
- Daily active users who use the search feature: 3 Million
- Number of requests a single server can handle: 1K
The number of servers required is calculated using the below formula:
...
Create a free account to access the full course.
By signing up, you agree to Educative's Terms of Service and Privacy Policy