System Design: Web Crawler
Learn to design a highly scalable web crawler by quantifying resource needs for massive data volumes. Architect a distributed System Design using key components like the HTML fetcher and scheduler. Implement robust mechanisms to handle challenges like crawler traps and rate limiting, ensuring high throughput and data consistency.
Introduction
A web crawler is a bot that systematically
Crawlers fetch web pages, parse content, and extract URLs for further crawling. This is the foundation of search engines. The crawler’s output feeds into subsequent stages:
Data cleaning
Indexing
Relevance scoring (e.g., PageRank)
URL frontier management
Analytics
This lesson focuses on the crawler’s System Design, excluding downstream stages like indexing or ranking. For those, refer to the chapter on distributed search.
Benefits of a web crawler
Beyond data collection, web crawlers provide:
Web page testing: Validating links and HTML structures.
Web page monitoring: Tracking content or structure updates.
Site mirroring: Creating
of popular websites.mirrors Mirroring is like making a dynamic carbon copy of a website. Mirroring refers to network services available by any protocol, such as HTTP or FTP. The URLs of these sites differ from the original sites, but the content is similar or almost identical. Copyright infringement checks: Detecting unauthorized content usage.
Challenges of a web crawler System Design
Designing a crawler involves several challenges:
Crawler traps: Infinite loops caused by dynamic links or calendar pages.
Duplicate content: Repeatedly crawling the same pages wastes resources.
Rate limiting: Fetching too many pages from a single domain overloads servers. We need load balancing to manage this.
DNS lookup latency: Frequent DNS lookups slow down the process.
Scalability: The system must handle millions of seed URLs and distribute the load across multiple servers.
A web crawler is a common system design interview topic to assess how candidates reason about components like the HTML fetcher, extractor, and scheduler. Interviewers often ask questions such as:
How would you design a scalable crawler using Redis for caching and AWS for infrastructure?
How would you handle request timeouts and website rate limits?
What optimization strategies would you use for parsers and fetchers at a FAANG scale?
How do metrics like response time and cache hit rate help evaluate performance?
Let’s begin by defining the requirements.
Requirements
We will highlight the functional and non-functional requirements.
Functional requirements
The system must perform the following:
Crawling: Scour the web starting from a queue of seed URLs.
Where do we get these seed URLs from?
Storing: Extract and store content in a blob store for indexing and ranking.
Scheduling: Regularly schedule crawling to update records.
Non-functional requirements
Scalability: The system must be distributed and multithreaded to fetch billions of documents.
Extensibility: Support new protocols (beyond HTTP) and file formats via modular extensions.
Consistency: Ensure data consistency across all crawling workers.
Performance: Use self-throttling to limit crawling per domain (by time or count) to avoid overloading hosts and optimize throughput.
Improved UI: Support customized, on-demand crawling beyond routine schedules.
With requirements established, we can estimate the scale.
Resource estimation
We will estimate storage, time, and server requirements.
Assumptions
We assume the following:
Total web pages: 5 billion
Text content per page:
2070 KB A study suggests that the average size of a webpage content is 2070 KB (2.07 MB) based on 892 processed websites. Metadata per page:
500 bytesMetadata It consists of a web page title and description of the web page showing its purpose.
Storage estimation
The total storage required for 5 billion pages is:
Traversal time
Assuming an average HTTP traversal time of
A single instance would take 9.5 years. To complete the task in one day, we need a multi-worker architecture.
Server estimation
Assuming one worker per server, we calculate the number of servers needed to finish in one day:
Since one server takes 3,468 days, we need 3,468 servers to complete the task in a single day.
Bandwidth estimation
Processing 10.35 PB of data per day requires the following total bandwidth:
Distributing this load among
Adjust the assumptions below to see how the estimates change:
Estimates Calculator for the Web Crawler
| Number of Webpages | 5 | Billion |
| Text Content per Webpage | 2070 | KB |
| Metadata per Webpage | 500 | Bytes |
| Total Storage | f10.35 | PB |
| Total Traversal Time on One Server | f9.5 | Years |
| Servers Required to Perform Traversal in One Day | f3468 | Servers |
| Bandwidth Estimate | f958.33 | Gb/sec |
These estimates confirm the need for high-scale tools. We will use the following building blocks:
Building blocks
The main components of our design are:
Scheduler: Schedules crawling events for URLs.
DNS: Resolves IP addresses.
Cache: Stores fetched documents for quick access.
Blob store: Stores crawled content.
We also include these specific components:
HTML fetcher: Connects to web hosts to download content.
Service host: Manages worker instances.
Extractor: Parses URLs and documents from web pages.
Duplicate eliminator: Performs deduplication testing on URLs and documents.
We arrange these components to meet our requirements.
Design
This section details the workflow and component interactions.
Components
Key building blocks include:
Scheduler: This is one of the key building blocks that schedules URLs for crawling. It’s composed of the following two units:
Priority queue (URL frontier): The queue hosts URLs that are made ready for crawling based on the two properties associated with each entry:
andpriority As a requirement, we need to assign variable priorities to URLs, depending on their content. This attribute defines the precedence of a URL while in the URL frontier. .updates frequency For recrawling purposes, we need to define the recrawl frequency for each URL. This attribute ensures a defined number of placements in the URL frontier for each URL. Relational database: It stores all the URLs along with the two associated parameters mentioned above. The database gets populated by new requests from the following two input streams:
The user’s added URLs, which include seed and runtime added URLs.
The crawler’s extracted URLs.
Can we estimate the size of the priority queue? What are the pros and cons of a centralized and distributed priority queue.
DNS resolver: Maps hostnames to IP addresses. To reduce latency, we use a custom DNS resolver that caches frequently used IPs.
HTML fetcher: Initiates communication with the host server to download content. While primarily focused on HTTP, it is extendable to other protocols.
How does the crawler handle URLs with variable priorities?
Service host: The “brain” of the crawler, managing worker instances. It performs three main tasks:
Manages the multi-worker architecture. Workers request the next URL from the URL frontier.
Resolves IP addresses via the DNS resolver.
Triggers the HTML fetcher using the resolved IP.
Extractor: Parses the web page to extract URLs and content. It sends URLs to the scheduler and content to the Document Input Stream (DIS) (e.g., Redis) for processing. Once verified as unique, content is stored in the blob store.
Duplicate eliminator: Prevents processing the same content twice. It calculates the checksum of each extracted URL and compares it against the URL checksum data store. If a match is found, the URL is discarded; otherwise, it is added to the store.
The duplicate eliminator repeats this process for document content, storing checksums in the document checksum data store.
The proposed design for the duplicate eliminator can be made robust against these two issues:
By using URL redirection, the new URL can pass through the URL dedup test. But, the second stage of the document dedup wouldn’t allow content duplication in the blob storage.
By changing just one Byte in a document, the checksum of the modified document is going to come out different than the original one.
Blob store: Because a web crawler is a core component of a search engine, storing and indexing fetched content and related metadata is essential. The system requires distributed storage, such as a blob store, to store large volumes of unstructured data.
The following illustration shows the overall design:
Workflow
Assignment to a worker: The crawler (service host) initiates the process by loading a URL from the URL frontier’s priority queue and assigns it to the available worker.
DNS resolution: The worker sends the incoming URL for DNS resolution. Before resolving the URL, the DNS resolver checks the cache and returns the requested IP address if it’s found. Otherwise, it determines the IP address, sends it back to the worker instance of the crawler, and stores the result in the cache.
Can we use DFS instead of BFS?
Communication initiation by the HTML fetcher: The worker forwards the URL and the associated IP address to the HTML fetcher, which initiates the communication between the crawler and the host server.
Content extraction: Once the worker establishes the communication, it extracts the URLs and the HTML document from the web page and places the document in a cache for other components to process it.
Dedup testing: The worker sends the extracted URLs and the document for dedup testing to the duplicate eliminator. The duplicate eliminator calculates and compares the checksum of both the URL and the document with the checksum values that have already been stored.The duplicate eliminator discards the incoming request in case of a match. If there’s no match, it places the newly-calculated checksum values in the respective data stores and gives the go-ahead to the extractor to store the content.
Content storing: The extractor sends the newly-discovered URLs to the scheduler, which stores them in the database and sets the values for the priority and recrawl frequency variables.The extractor also writes the required portions of the newly discovered document, currently in the DIS, into the database.
Recrawling: Once a cycle is complete, the crawler goes back to the first point and repeats the same process until the URL frontier queue is empty. The URLs stored in the scheduler’s database have priority and periodicity assigned to them. Enqueuing new URLs into the URL frontier depends on these two factors.
Note: Given the microservices architecture, the design can utilize client-side load balancing.
The slideshow below details the workflow:
Let’s test your understanding with the help of an AI assessment:
Recrawling is the process of revisiting and updating indexed pages. Keeping this in mind, try to answer the following questions:
- Why do we need to recrawl?
- How do we decide when to recrawl?
We will now refine the design to enhance functionality, performance, and security.
Design improvements
We address specific shortcomings in the initial design:
Shortcoming: The design currently supports only HTTP and text extraction.
Adjustment:
HTML fetcher: Add modules for other protocols (e.g., FTP). The crawler invokes the correct module based on the URL scheme. The subsequent steps will remain the same.
Extractor: Add modules to process non-text media (images, videos) from the Document Input Stream (DIS). These are stored in the blob store alongside text.
Shortcoming: The design lacks details on distributing work among multiple workers.
Adjustment: Workers dequeue URLs as they become available. We can distribute tasks using:Domain-level assignment: Assign an entire domain to a specific worker (by hashing the hostname). This prevents redundant crawling and supports reverse URL indexing.
Range division: Assign ranges of URLs to workers.
Per URL crawling: Workers take individual URLs. This requires coordination to avoid collisions.
Crawler traps
A crawler trap is a URL structure that causes indefinite crawling, exhausting resources.
Classification
Traps often result from poor website structure:
Query parameters: Useless variations of a page (e.g.,
HTTP://www.abc.com?query).Internal links: Infinite redirection loops within a domain.
Calendar pages: Infinite combinations of dates.
Dynamic content: Infinite pages generated from queries.
Cyclic directories: Loops like
HTTP://www.abc.com/first/second/first/second/....
Traps can be accidental or intentional (malicious). They waste
Identification
We identify traps by analyzing:
URL scheme: Detecting patterns like cyclic directories.
Page count: Flagging domains with an implausible number of pages.
Solution
To avoid traps and optimize resources, we implement the following:
Application logic: Limit crawling on a domain based on page count or depth. The crawler should identify and “no-go” areas.
Robots exclusion protocol: Fetch and adhere to the
robots.txtfile. This file specifies allowed pages and revisit frequency, preventing unnecessary crawling.
Note:
robots.txtdoes not prevent malicious traps. Other mechanisms must handle those.
Politeness: Adjust crawl speed based on the domain’s time to first byte (TTFB). Slower servers receive slower crawls to avoid timeouts and overload.
These modifications ensure the crawler is robust and efficient. Next, we validate the architecture against our requirements.
Requirements compliance
We evaluate the design against non-functional requirements.
Scalability
The design supports horizontal scaling:
Components (schedulers, workers, fetchers, stores) can be added or removed on demand.
Consistent hashing distributes hostnames across workers, allowing seamless addition/removal of servers.
Extensibility and modularity
So far, our design is only focusing on a particular type of communication protocol: HTTP. But according to our non-functional requirements, our system’s design should facilitate the inclusion of other network communication protocols like FTP.
To achieve this extensibility, we only need to add additional modules for the newly required communication protocols in the HTML fetcher. The respective modules will then be responsible for making and maintaining the required communications with the host servers.
Along the same lines, we expect our design to extend its functionality for other
Consistency
The system consists of several crawling workers. Data consistency among crawled content is crucial. So, to avoid data inconsistency and crawl duplication, our system computes the checksums of URLs and documents and compares them with the existing checksums of the URLs and documents in the URL and document checksum data stores, respectively.
Apart from deduplication, to ensure the data consistency by fault-tolerance conditions, all the servers can checkpoint their states to a backup service, such as Amazon S3 or an offline disk, regularly.
Performance
The proposed web crawler’s performance depends on the following factors:
URLs crawled per second: We can improve this factor by adding new workers to the system.
Utilizing blob storage for content storing: This ensures higher throughput for the massive amount of unstructured data. It also indicates a fast retrieval of the stored content, because a single blob can support up to 500 requests per second.
Efficient implementation of the
robots.txtfile guideline: We can implement this performance factor by having an application-layer logic of setting the highest precedence ofrobots.txtguidelines while crawling.Self-throttling: We can have various application-level checks to ensure that our web crawler doesn’t hamper the performance of the website host servers by exhausting their resources.
Scheduling
As was established previously, we may need to recrawl URLs on various frequencies. These frequencies are determined by the application of the URL. We can determine the frequency of recrawl in two different ways:
We can assign a default or a specific recrawling frequency to each URL. This assignment depends on the application of the URL defining the priority. A default frequency is assigned to the standard-priority URLs and a higher recrawl frequency is given to the higher-priority URLs.Based on each URL’s associated recrawl frequency, we can decide to enqueue URLs in the priority queue from the scheduler’s database. The priority defines the place of a URL in the queue.
The second method is to have separate queues for various priority URLs, use URLs from high-priority queues first, and subsequently move to the lower-priority URLs.
The following table summarizes the techniques for achieving non-functional requirements in a web crawler system:
Non-Functional Requirements Compliance
Requirement | Techniques |
Scalability |
|
Extensibility and Modularity |
|
Consistency |
|
Performance |
|
Scheduling |
|
Let’s test your understanding with the help of an AI assessment:
Conclusion
The web crawler system entails a multi-worker design that uses a microservices architecture. Besides achieving the basic crawling functionality, our design provides insights into the potential shortcomings and challenges associated with our design and further rectifies them with appropriate design modifications. The noteworthy features of our design are as follows:
Identification and design modification for crawler traps.
Extensibility of HTML fetching and content extraction modules.