What Is Middleware?
Learn about Scrapy middleware and explore how to attach them to requests and responses.
Now, we will explore Scrapy’s middleware, a crucial framework component. Middleware is pivotal in modifying and controlling Scrapy’s request and response objects by attaching middleware to perform specific processing.
Downloader middleware
Downloader middleware allows us to manipulate requests and responses, add custom headers, handle proxies, or modify how Scrapy interacts with websites. To enable a downloader middleware component, we should define it in the spider settings, the same as we did with Pipelines
. To do that, we add this code inside custom_settings
in the spider class:
custom_settings = {'DOWNLOADER_MIDDLEWARES':{"ScrapyProject.middlewares.CustomDownloaderMiddleware": 543}}
Much like Pipelines, middleware is implemented in a specific order determined by assigned numbers.
Built-in downloader middleware
Built-in downloader middleware are components between Scrapy’s engine and the website we are scraping. There are several built-in downloader middleware that cover common use cases. Some of the notable built-in middleware include:
CookiesMiddleware
This middleware enables working with sites that require cookies, such as those that use sessions. It keeps track of cookies sent by web servers and sends them back on subsequent requests (from that spider), just like web browsers do.
To enable this middleware, we set
COOKIES_ENABLED
andCOOKIES_DEBUG
totrue
in spider settings.
UserAgentMiddleware
It rotates User-Agent headers to avoid detection as a web scraper. We can customize it to use specific User Agents. It overrides the default user agent.
This middleware is enabled by setting
self.user_agent
in the spider class, which will override theUSER_AGENT
value in the spider settings.
RetryMiddleware
Manages request retries in case of network errors or HTTP error codes. Failed pages are collected during the scraping process and rescheduled once the spider has finished crawling all regular (non-failed) pages.
This Middleware can be configured using
RETRY_ENABLED
andRETRY_TIMES
in the spider settings.
Let’s see an example of utilizing some of these middleware.
# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.io/en/latest/deploy.html [settings] default = scraper.settings [deploy] #url = http://localhost:6800/ project = scraper
Code explanation
Line 12–20: We define custom settings for the spider, including middleware and retry settings.
Line 13: Configures downloader middleware such as
CookiesMiddleware
andRetryMiddleware
.Lines 17–18: Enables cookies handling (
COOKIES_ENABLED
) and debugging (COOKIES_DEBUG
).Line 19: Sets the maximum number of retry times to 3 (
RETRY_TIMES
).
Line 22–30: We send a start request to the login page. ...