Introduction to Web Crawler [backup]

Understand the requirements to design a web crawler.

Introduction

A web crawler is an Internet bot that systematically scoursTo go or move swiftly about, over, or through in search of something the World Wide Web for content, starting its operation from a pool of seed URLsStored URLs serving the purpose of a starting point to a crawler. This process of acquiring content from the WWW is called crawling. It further saves the crawled content in the data stores. This process of efficiently saving data for subsequent use is called storing.

It’s the first step performed by search engines; the stored data is used for indexing and ranking purposes. This specific design problem is limited to web crawlers and does not include explanations of search engines’ later stages of indexing and ranking.

Additional utilities of a web crawler are as follows:

Web pages testing: Web crawlers are a way to check the validity of web pages’ links and structure.
Web page monitoring: We use web crawlers to monitor the content or structure updates on web pages.
Site mirroring: Web crawlers are an effective way to mirrorMirroring is like making a dynamic carbon copy of a website. Mirroring refers to network services available by any protocol, such as HTTP or FTP. The URLs of these sites differ from the original sites, but the content is similar or almost identical. popular websites.
Copyright infringement check: Web crawlers fetches content and check for copyright infringement issues.

This chapter will design a web crawler and evaluate how it fulfills the functional and non-functional requirements.

The output of the crawling process is the data that is the input to the subsequent processing phases (data cleaning, indexing, page relevance using algorithms like page ranks, and analytics). For some of these subsequent stages, see our chapter on distributed search.

Requirements and goals

Let’s highlight the functional and non-functional requirements for a web crawler.

Functional

Below are the functionalities a user might be able to perform:

Crawling: The system should scour the World Wide Web, spanning from a queue of seed URLs provided initially by the system administrator.

Non-functional

Scalability: The system should inherently be distributed and multithreaded as it has to fetch hundreds of millions of web documents.
Extensibility: Currently, our design supports HTTP(S) communication protocol and text files storage facilities. For augmented functionality, it should also be extensible for different network communication protocols and add multiple modules to process and store various file formats.
Consistency: Since our system will involve multiple crawling workers, having data consistency among all of them is required.
Performance: The system should be smart enough to limit its crawling at a domain, either by time spent or by the count of URLs visited of that domain: self-throttling. The URLs crawled per second, and the throughput of the content crawled should be optimal.

Estimations

We need to estimate various resources requirements for our design.

Assumptions: Following are the assumptions for our requirements’ estimations:

There are a total of 5 Billion webpages.
The text content per webpage is 2070KBA study suggests that the average size of a webpage content is 2070KB (2.07MB) based on 892 processed websites..
The metadataIt consists of a webpage title and description of the webpage showing its purpose. for one webpage is 500 Bytes.

Storage requirements

The collective storage required to store the textual content of 5 Billion web pages is: ...

Create a free account to access the full course.

By signing up, you agree to Educative's Terms of Service and Privacy Policy

Introduction

Abstractions

Non-functional System Characteristics

Back-of-the-Envelope Calculations

Building Blocks

Domain Name System (DNS)

Sequencer

Rate Limiter

Distributed Cache

Blob Store

Content Delivery Network (CDN)

Load Balancers

Key-Value Store

Distributed Messaging Queue

Pub-sub

Distributed Task Scheduler

Distributed Search

Distributed Logging

Distributed Monitoring

Monitoring Server Side Errors

Monitoring Client Side Errors

Databases

Sharded Counters

Concluding Building Blocks

Design YouTube

Design Quora

Design Google Maps

Designing a Proximity Server like Yelp

Design Uber

Design Twitter

Newsfeed System

Design Instagram

Design URL Shortening Service / TinyURL

Design a Web Crawler

Design WhatsApp

Design Typeahead Suggestion

Design Collaborative Document Editing Service / Google Docs

Spectacular Failures

Concluding Remarks

Appendix: System Design Interviews

All content below this will likely go away

Design Exercises

Archived temporary lessons

Design Resource Allocator for a Large Datacenter

Design Zoom

Continuous Monitoring using Data Processing

Design Live Commenting at Facebook

Security

For Noor: Placeholder for Illustration Making

Appendix

Backup of our Lessons

Caching Billions of Tiny Objects on Flash

Design Quora

Copy-Design YouTube

Identity & Access Management

Copy of CDN (02-03-2022)

Introduction to Web Crawler [backup]

Introduction

Requirements and goals

Functional

Non-functional

Estimations

Storage requirements