...

/

Requirements of a Web Crawler's Design

Requirements of a Web Crawler's Design

Learn about the design requirements of a web crawler.

Requirements

Let’s highlight the functional and non-functional requirements of a web crawler.

Functional requirements

These are the functionalities a user must be able to perform:

  • Crawling: The system should scour the WWW, spanning from a queue of seed URLs provided initially by the system administrator.

Points to Ponder

1.

Where do we get these seed URLs from?

Show Answer
Q1 / Q3
  • Storing: The system should be able to extract and store the content of a URL in a blob store. This makes that URL and its content processable by the search engines for indexing and ranking purposes.

  • Scheduling: Since crawling is a process that’s repeated, the system should have regular scheduling to update its blob stores’ records.

Non-functional requirements

  • Scalability: The system should inherently be distributed and multithreaded, because it has to fetch hundreds of millions of web documents.

  • Extensibility: Currently, our design supports HTTP(S) communication protocol and text files storage facilities. For augmented functionality, it should also be extensible for different network communication protocols, able to add multiple modules to process, and store various file formats.

  • Consistency: Since our system involves multiple crawling workers, having data consistency among all of them is necessary.

    In the general context, data consistency means the reliability and accuracy of data across a system or dataset. In the web crawler’s context, it refers to the adherence of all the workers to a specific set of rules in their attempt to generate consistent crawled data.

  • Performance: The system should be smart enough to limit its crawling to a domain, either by time spent or by the count of the visited URLs of that domain. This process is called self-throttling. The URLs crawled per second and the throughput of the content crawled should be optimal.

  • Improved user interface—customized scheduling: Besides the default recrawling, which is a functional requirement, the system should also support the functionality to perform non-routine customized crawling on the system administrator’s demands.

Resource estimation

We need to estimate various resource requirements for our design.

Assumptions

These are the assumptions we’ll use when estimating our resource requirements:

  • There are a total of 5 billion web pages.
  • The text content per webpage is 2070 KBA study suggests that the average size of a webpage content is 2070KB (2.07MB) based on 892 processed websites..
  • The metadataIt consists of a webpage title and description of the web page showing its purpose. for one web page is 500 Bytes.

Storage estimation

The collective storage required to store the textual content of 5 billion web pages is: Total storage per crawl=5 Billion×(2070 KB+500B)=10.35PBTotal\ storage\ per\ crawl = 5\ Billion \times (2070\ KB + 500B) = 10.35 PB ...

Access this course and 1400+ top-rated courses and projects.