What Is Middleware?

Learn about Scrapy middleware and explore how to attach them to requests and responses.

Now, we will explore Scrapy’s middleware, a crucial framework component. Middleware is pivotal in modifying and controlling Scrapy’s request and response objects by attaching middleware to perform specific processing.

Press + to interact
Middleware types
Middleware types

Downloader middleware

Downloader middleware allows us to manipulate requests and responses, add custom headers, handle proxies, or modify how Scrapy interacts with websites. To enable a downloader middleware component, we should define it in the spider settings, the same as we did with Pipelines. To do that, we add this code inside custom_settings in the spider class:

custom_settings = {
'DOWNLOADER_MIDDLEWARES':{
"ScrapyProject.middlewares.CustomDownloaderMiddleware": 543}
}

Much like Pipelines, middleware is implemented in a specific order determined by assigned numbers.

Built-in downloader middleware

Built-in downloader middleware are components between Scrapy’s engine and the website we are scraping. There are several built-in downloader middleware that cover common use cases. Some of the notable built-in middleware include:

  • CookiesMiddleware

    • This middleware enables working with sites that require cookies, such as those that use sessions. It keeps track of cookies sent by web servers and sends them back on subsequent requests (from that spider), just like web browsers do.

    • To enable this middleware, we set COOKIES_ENABLED and COOKIES_DEBUG to true in spider settings.

  • UserAgentMiddleware

    • It rotates User-Agent headers to avoid detection as a web scraper. We can customize it to use specific User Agents. It overrides the default user agent.

    • This middleware is enabled by setting self.user_agent in the spider class, which will override the USER_AGENT value in the spider settings.

  • RetryMiddleware

    • Manages request retries in case of network errors or HTTP error codes. Failed pages are collected during the scraping process and rescheduled once the spider has finished crawling all regular (non-failed) pages.

    • This Middleware can be configured using RETRY_ENABLED and RETRY_TIMES in the spider settings.

Let’s see an example of utilizing some of these middleware.

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = scraper.settings

[deploy]
#url = http://localhost:6800/
project = scraper
Scraping quotes by utilizing built-in middleware

Code explanation

  • Line 12–20: We define custom settings for the spider, including middleware and retry settings.

    • Line 13: Configures downloader middleware such as CookiesMiddleware and RetryMiddleware.

    • Lines 17–18: Enables cookies handling (COOKIES_ENABLED) and debugging (COOKIES_DEBUG).

    • Line 19: Sets the maximum number of retry times to 3 (RETRY_TIMES).

  • Line 22–30: We send a start request to the login page. ...