What is Scrapy?

Feature	Benefit
Crawling and spidering	Scrapy allows programmers to create customizable spiders. Spiders are classes that define how to navigate a website and what data to extract.
Item pipelines	Scrapy allows programmers to process and store data using pipelines. These processes include data validation, cleaning, transformation, and storage in databases or files.
Request and response handling	Scrapy handles complex tasks like sending HTTP requests, handling responses, and managing cookies and sessions.
Middleware	Scrapy allows the use of middleware to customize and modify responses, requests, and crawling.
Robustness and throttling	Scrapy can handle various aspects of crawling etiquette, such as respecting the website’s legal clauses, handling request delays, and managing concurrency to avoid overloading websites.
XPath and CSS Selectors	Scrapy enables both XPath and CSS selectors for data extraction.
Extensions and plugins	Scrapy can be extended using third-party plugins and frameworks, making it adaptable. One popular framework used with Scrapy is Selenium.
Asynchronous processing	Scrapy allows asynchronous processing, making it an efficient web crawler and allowing concurrent web scraping of multiple websites.

Explanation

The following explanation of the code shows how our spider works:

Lines 1–2: We import the Scrapy library and define the spider class.
Lines 4–6: We set the default variables for the spider. These variables are initialized when generating our spider. The data inside the variables can be edited, but do not change the variable names.
- Line 4: We use the name variable for executing the program in the terminal. This variable defines the name of this spider, which must be unique.
- Line 5: The allowed_domains contains the list of all the domains that the spider can access. This is an optional list.
- Line 6: We use start_urls to define the URL from which the spider starts scraping. If no URL is provided to this variable then the first page that the spider downloads from the allowed_domain will be listed here.
Line 9: This is the default callback used by Scrapy to process the downloaded responses. This function will prove as our main function in this example.
Line 11: We use the CSS selectorsTools used to find/select the HTML elements that user wants to style. to get the data of all books and store them in a user-defined variable.
Lines 13–16: We traverse through all the data of the books that we scraped from the website and retrieve each book’s title using the CSS selectors.

Conclusion

Scrapy is a popular choice among developers for web scraping because of its flexibility, extensibility, and the fact that it is able to handle many of the technical challenges associated with web crawling. This framework’s asynchronous processing and robustness make it an excellent tool for making large-scale and scalable web crawlers.

Free Resources

Learn in-demand tech skills in half the time

PRODUCTS

Mock Interview

New

Courses

Skill Paths

Projects

Assessments

What is Scrapy?

Features of Scrapy

Example

Explanation

Conclusion