Design a Web Crawler

Learn to design a web crawler.

Introduction

A web crawler is an Internet bot that systematically scoursTo go or move swiftly about, over, or through in search of something. the world wide web (WWW) for content, starting its operation from a pool of seed URLsStored URLs that serve as a starting point for a crawler.. This process of acquiring content from the WWW is called crawling. It further saves the crawled content in the data stores. The process of efficiently saving data for subsequent use is called storing.

This is the first step that’s performed by search engines; the stored data is used for indexing and ranking purposes. This specific design problem is limited to web crawlers and doesn’t include explanations of the later stages of indexing and ranking in search engines.

Requirements

Functional requirements

  • Crawling: The system should scour the WWW, spanning from a queue of seed URLs provided initially by the system administrator.

  • Storing: The system should be able to extract and store the content of a URL in a blob store. This makes that URL and its content processable by the search engines for indexing and ranking purposes.

  • Scheduling: Since crawling is a repeated process, the system should have regular scheduling to update its blob stores’ records.

Non-functional requirements

  • Scalability: The system should inherently be distributed and multithreaded because it has to fetch hundreds of millions of web documents.

  • Extensibility: Currently, our design supports HTTP(S) communication protocol and text files storage facilities. For augmented functionality, it should also be extensible for different network communication protocols as well as be able to add multiple modules to process and store various file formats.

  • Consistency: Since our system involves multiple crawling workers, having data consistency among all of them is necessary.

  • Performance: The system should be smart enough to limit its crawling to a domain, either by time spent or by the count of the visited URLs of that domain. This process is called self-throttling. The URLs crawled per second, and the throughput of the content crawled should be optimal.

  • Improved user interface—customized scheduling: Besides the default recrawling, which is a functional requirement, the system should also support the functionality to perform nonroutine customized crawling on the system administrator’s demands.

Building blocks we will use

Here is the list of the main building blocks we’ll use in our design:

  • A scheduler is used to schedule crawling events on the URLs that are stored in its database.

  • A DNS is needed to get the IP address resolution of the web pages.

  • Cache is utilized for storing fetched documents for quick access by all the processing modules.

  • The blob store’s main application is to store the crawled content.

Design

Below, we will describe the building blocks and the additional components involved in the design and workflow of the web crawling process with respect to its requirements.

Components

Here are the details of the building blocks and the components needed for our design:

  • Scheduler: This is one of the key building blocks that schedules URLs for crawling. It’s composed of two units: a priority queue and a relational database.

    • A priority queue (URL frontier): The queue hosts URLs that are made ready for crawling based on the two properties associated with each entry: priorityAs a requirement, we need to assign variable priorities to URLs, depending on their content. This attribute defines the precedence of a URL while in the URL frontier. and updates frequencyWe need to define the recrawl frequency for each URL for recrawling purposes. This attribute ensures a defined number of placements in the URL frontier for each URL..

    • Relational database: ...