Introduction to Web Crawler [backup]
Understand the requirements to design a web crawler.
Introduction
A web crawler is an Internet bot that systematically
It’s the first step performed by search engines; the stored data is used for indexing and ranking purposes. This specific design problem is limited to web crawlers and does not include explanations of search engines’ later stages of indexing and ranking.
Additional utilities of a web crawler are as follows:
-
Web pages testing: Web crawlers are a way to check the validity of web pages’ links and structure.
-
Web page monitoring: We use web crawlers to monitor the content or structure updates on web pages.
-
Site mirroring: Web crawlers are an effective way to
popular websites.mirror Mirroring is like making a dynamic carbon copy of a website. Mirroring refers to network services available by any protocol, such as HTTP or FTP. The URLs of these sites differ from the original sites, but the content is similar or almost identical. -
Copyright infringement check: Web crawlers fetches content and check for copyright infringement issues.
This chapter will design a web crawler and evaluate how it fulfills the functional and non-functional requirements.
The output of the crawling process is the data that is the input to the subsequent processing phases (data cleaning, indexing, page relevance using algorithms like page ranks, and analytics). For some of these subsequent stages, see our chapter on distributed search.
Requirements and goals
Let’s highlight the functional and non-functional requirements for a web crawler.
Functional
Below are the functionalities a user might be able to perform:
- Crawling: The system should scour the World Wide Web, spanning from a queue of seed URLs provided initially by the system administrator.
Food for thought!
From where do we get these seed URLs?
-
Storing: The system should be able to extract and store the content of a URL in a blob store, making that URL, along with its content, processable by the search engines for indexing and ranking purposes.
-
Scheduling: Since crawling is a repeated process, the system should have regular scheduling to update its blob stores records.
Non-functional
-
Scalability: The system should inherently be distributed and multithreaded as it has to fetch hundreds of millions of web documents.
-
Extensibility: Currently, our design supports HTTP(S) communication protocol and text files storage facilities. For augmented functionality, it should also be extensible for different network communication protocols and add multiple modules to process and store various file formats.
-
Consistency: Since our system will involve multiple crawling workers, having data consistency among all of them is required.
-
Performance: The system should be smart enough to limit its crawling at a domain, either by time spent or by the count of URLs visited of that domain: self-throttling. The URLs crawled per second, and the throughput of the content crawled should be optimal.
- Improved user interface - customized scheduling: Besides the default recrawling which is a functional requirement, the system should also support the functionality to perform non-routine customized crawling on the system administrator’s demands.
Estimations
We need to estimate various resources requirements for our design.
Assumptions: Following are the assumptions for our requirements’ estimations:
- There are a total of 5 Billion webpages.
- The text content per webpage is
.2070KB A study suggests that the average size of a webpage content is 2070KB (2.07MB) based on 892 processed websites. - The
for one webpage is 500 Bytes.metadata It consists of a webpage title and description of the webpage showing its purpose.
Storage requirements
The collective storage required to store the textual content of 5 Billion web pages is: ...
Create a free account to access the full course.
By signing up, you agree to Educative's Terms of Service and Privacy Policy