...

/

Introduction to Web Crawler [backup]

Introduction to Web Crawler [backup]

Understand the requirements to design a web crawler.

Introduction

A web crawler is an Internet bot that systematically scoursTo go or move swiftly about, over, or through in search of something the World Wide Web for content, starting its operation from a pool of seed URLsStored URLs serving the purpose of a starting point to a crawler. This process of acquiring content from the WWW is called crawling. It further saves the crawled content in the data stores. This process of efficiently saving data for subsequent use is called storing.

It’s the first step performed by search engines; the stored data is used for indexing and ranking purposes. This specific design problem is limited to web crawlers and does not include explanations of search engines’ later stages of indexing and ranking.

Additional utilities of a web crawler are as follows:

  • Web pages testing: Web crawlers are a way to check the validity of web pages’ links and structure.

  • Web page monitoring: We use web crawlers to monitor the content or structure updates on web pages.

  • Site mirroring: Web crawlers are an effective way to mirrorMirroring is like making a dynamic carbon copy of a website. Mirroring refers to network services available by any protocol, such as HTTP or FTP. The URLs of these sites differ from the original sites, but the content is similar or almost identical. popular websites.

  • Copyright infringement check: Web crawlers fetches content and check for copyright infringement issues.

This chapter will design a web crawler and evaluate how it fulfills the functional and non-functional requirements.

The output of the crawling process is the data that is the input to the subsequent processing phases (data cleaning, indexing, page relevance using algorithms like page ranks, and analytics). For some of these subsequent stages, see our chapter on distributed search.

Requirements and goals

Let’s highlight the functional and non-functional requirements for a web crawler.

Functional

Below are the functionalities a user might be able to perform:

  • Crawling: The system should scour the World Wide Web, spanning from a queue of seed URLs provided initially by the system administrator.

Food for thought!

1.

From where do we get these seed URLs?

Show Answer
Q1 / Q3
  • Storing: The system should be able to extract and store the content of a URL in a blob store, making that URL, along with its content, processable by the search engines for indexing and ranking purposes.

  • Scheduling: Since crawling is a repeated process, the system should have regular scheduling to update its blob stores records.

Non-functional

  • Scalability: The system should inherently be distributed and multithreaded as it has to fetch hundreds of millions of web documents.

  • Extensibility: Currently, our design supports HTTP(S) communication protocol and text files storage facilities. For augmented functionality, it should also be extensible for different network communication protocols and add multiple modules to process and store various file formats.

  • Consistency: Since our system will involve multiple crawling workers, having data consistency among all of them is required.

  • Performance: The system should be smart enough to limit its crawling at a domain, either by time spent or by the count of URLs visited of that domain: self-throttling. The URLs crawled per second, and the throughput of the content crawled should be optimal.

  • Improved user interface - customized scheduling: Besides the default recrawling which is a functional requirement, the system should also support the functionality to perform non-routine customized crawling on the system administrator’s demands.

Estimations

We need to estimate various resources requirements for our design.

Assumptions: Following are the assumptions for our requirements’ estimations:

  • There are a total of 5 Billion webpages.
  • The text content per webpage is 2070KBA study suggests that the average size of a webpage content is 2070KB (2.07MB) based on 892 processed websites..
  • The metadataIt consists of a webpage title and description of the webpage showing its purpose. for one webpage is 500 Bytes.

Storage requirements

The collective storage required to store the textual content of 5 Billion web pages is: ...

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy