What is Scrapy?

Data is a vital asset in today’s world, and the place with the most abundant data is the internet. We build web scrapersIt is the process of extracting data or content from websites. to obtain this data. One of the libraries used to make these web scrapers is Scrapy, which is a library dedicated to web scraping in Python. This framework has a massive community and is the industry standard for web scraping in Python.

Features of Scrapy

Scrapy is used for data extraction, mining, and information retrieval for research, data analysis, and training artificial intelligence models. To enable these tasks, Scrapy has a few key features, which are shown in the table below:

Feature

Benefit

Crawling and spidering

Scrapy allows programmers to create customizable spiders. Spiders are classes that define how to navigate a website and what data to extract.

Item pipelines

Scrapy allows programmers to process and store data using pipelines. These processes include data validation, cleaning, transformation, and storage in databases or files.

Request and response handling

Scrapy handles complex tasks like sending HTTP requests, handling responses, and managing cookies and sessions.

Middleware

Scrapy allows the use of middleware to customize and modify responses, requests, and crawling.

Robustness and throttling

Scrapy can handle various aspects of crawling etiquette, such as respecting the website’s legal clauses, handling request delays, and managing concurrency to avoid overloading websites.

XPath and CSS Selectors

Scrapy enables both XPath and CSS selectors for data extraction.

Extensions and plugins

Scrapy can be extended using third-party plugins and frameworks, making it adaptable. One popular framework used with Scrapy is Selenium.

Asynchronous processing

Scrapy allows asynchronous processing, making it an efficient web crawler and allowing concurrent web scraping of multiple websites.

Example

The following code below shows a basic spider made to extract book titles from a scraping website.

Note: The output is shown in the terminal, not the output bar. Please scroll the terminal to find the list of the book titles.

# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.
Book title scraper

Explanation

The following explanation of the code shows how our spider works:

  • Lines 1–2: We import the Scrapy library and define the spider class.

  • Lines 4–6: We set the default variables for the spider. These variables are initialized when generating our spider. The data inside the variables can be edited, but do not change the variable names.

    • Line 4: We use the name variable for executing the program in the terminal. This variable defines the name of this spider, which must be unique.

    • Line 5: The allowed_domains contains the list of all the domains that the spider can access. This is an optional list.

    • Line 6: We use start_urls to define the URL from which the spider starts scraping. If no URL is provided to this variable then the first page that the spider downloads from the allowed_domain will be listed here.

  • Line 9: This is the default callback used by Scrapy to process the downloaded responses. This function will prove as our main function in this example.

  • Line 11: We use the CSS selectorsTools used to find/select the HTML elements that user wants to style. to get the data of all books and store them in a user-defined variable.

  • Lines 13–16: We traverse through all the data of the books that we scraped from the website and retrieve each book’s title using the CSS selectors.

Conclusion

Scrapy is a popular choice among developers for web scraping because of its flexibility, extensibility, and the fact that it is able to handle many of the technical challenges associated with web crawling. This framework’s asynchronous processing and robustness make it an excellent tool for making large-scale and scalable web crawlers.

Copyright ©2024 Educative, Inc. All rights reserved