Mastering Web Scraping Using Python: From Beginner to Advanced/

...

What Is Middleware?

Learn about Scrapy middleware and explore how to attach them to requests and responses.

We'll cover the following...

Much like Pipelines, middleware is implemented in a specific order determined by assigned numbers.

Built-in downloader middleware

Built-in downloader middleware are components between Scrapy’s engine and the website we are scraping. There are several built-in downloader middleware that cover common use cases. Some of the notable built-in middleware include:

CookiesMiddleware
- This middleware enables working with sites that require cookies, such as those that use sessions. It keeps track of cookies sent by web servers and sends them back on subsequent requests (from that spider), just like web browsers do.
- To enable this middleware, we set COOKIES_ENABLED and COOKIES_DEBUG to true in spider settings.
UserAgentMiddleware
- It rotates User-Agent headers to avoid detection as a web scraper. We can customize it to use specific User Agents. It overrides the default user agent.
- This middleware is enabled by setting self.user_agent in the spider class, which will override the USER_AGENT value in the spider settings.
RetryMiddleware
- Manages request retries in case of network errors or HTTP error codes. Failed pages are collected during the scraping process and rescheduled once the spider has finished crawling all regular (non-failed) pages.
- This Middleware can be configured using RETRY_ENABLED and RETRY_TIMES in the spider settings.

Let’s see an example of utilizing some of these middleware.

Introduction to Course Content and Web Scraping

Fundamental Concepts of Web Scraping

Dynamic Sites with Selenium

Assessment: Python Scraping

Scrapy Framework

Scraping Educative’s Courses Information

Wrap Up

What Is Middleware?

Downloader middleware

Built-in downloader middleware

Code explanation