Data is a vital asset in today’s world, and the place with the most abundant data is the internet. We build
Scrapy is used for data extraction, mining, and information retrieval for research, data analysis, and training artificial intelligence models. To enable these tasks, Scrapy has a few key features, which are shown in the table below:
Feature | Benefit |
Crawling and spidering | Scrapy allows programmers to create customizable spiders. Spiders are classes that define how to navigate a website and what data to extract. |
Item pipelines | Scrapy allows programmers to process and store data using pipelines. These processes include data validation, cleaning, transformation, and storage in databases or files. |
Request and response handling | Scrapy handles complex tasks like sending HTTP requests, handling responses, and managing cookies and sessions. |
Middleware | Scrapy allows the use of middleware to customize and modify responses, requests, and crawling. |
Robustness and throttling | Scrapy can handle various aspects of crawling etiquette, such as respecting the website’s legal clauses, handling request delays, and managing concurrency to avoid overloading websites. |
XPath and CSS Selectors | Scrapy enables both XPath and CSS selectors for data extraction. |
Extensions and plugins | Scrapy can be extended using third-party plugins and frameworks, making it adaptable. One popular framework used with Scrapy is Selenium. |
Asynchronous processing | Scrapy allows asynchronous processing, making it an efficient web crawler and allowing concurrent web scraping of multiple websites. |
The following code below shows a basic spider made to extract book titles from a scraping website.
Note: The output is shown in the terminal, not the output bar. Please scroll the terminal to find the list of the book titles.
# This package will contain the spiders of your Scrapy project # # Please refer to the documentation for information on how to create and manage # your spiders.
The following explanation of the code shows how our spider works:
Lines 1–2: We import the Scrapy library and define the spider class.
Lines 4–6: We set the default variables for the spider. These variables are initialized when generating our spider. The data inside the variables can be edited, but do not change the variable names.
Line 4: We use the name
variable for executing the program in the terminal. This variable defines the name of this spider, which must be unique.
Line 5: The allowed_domains
contains the list of all the domains that the spider can access. This is an optional list.
Line 6: We use start_urls
to define the URL from which the spider starts scraping. If no URL is provided to this variable then the first page that the spider downloads from the allowed_domain
will be listed here.
Line 9: This is the default callback used by Scrapy to process the downloaded responses. This function will prove as our main function in this example.
Line 11: We use the
Lines 13–16: We traverse through all the data of the books that we scraped from the website and retrieve each book’s title using the CSS selectors.
Scrapy is a popular choice among developers for web scraping because of its flexibility, extensibility, and the fact that it is able to handle many of the technical challenges associated with web crawling. This framework’s asynchronous processing and robustness make it an excellent tool for making large-scale and scalable web crawlers.