Scrapy Data Pipeline
Learn how Scrapy organizes the data pipeline and exports it in any structured format.
We'll cover the following...
Having familiarized ourselves with Scrapy's fundamental modules, which empower us to extract information from various websites, it's time to explore exporting our scraper's output in a structured format.
Core modules
Scrapy offers a systematic approach to organizing the data we scrape in unstructured formats that can be easily employed for various purposes. It achieves this through the utilization of three core modules:
The diagram below illustrates the fundamental connections between these modules:
Spider.py
is the core scraping spider code. It utilizes Items.py
with ItemLoader
to containerize the scraped data, then using ItemPipeline.py
to perform final processing on the data and save it in a structured format.
Items
Items are simple containers that hold the data we want to extract from a website. They serve as a structured data representation and help us maintain consistency in our scraped results.
Items are defined using Python classes that inherit from scrapy.Item
inside the Items.py
file. Each attribute of the item class represents a piece of data we want to extract. By defining the fields in the item class, we specify the data structure we will scrape.
Here’s a basic example of defining a Scrapy item for scraping quotes from the Quotes to Scrape website:
import scrapyclass QuoteItem(scrapy.Item):text = scrapy.Field()author = scrapy.Field()tags = scrapy.Field()
In this example, the QuoteItem
class represents a quote with its corresponding text
, author
, and tags
. Field
objects are used to specify metadata for each field. We can specify any metadata for each field. There is no restriction on the values accepted by Field
objects.
Once we've defined our item class, we can start using it. Within our spider's parsing methods, we can create instances of the item class, assign values to its fields, and yield the populated item.
# Automatically created by: scrapy startproject # # For more information about the [deploy] section see: # https://scrapyd.readthedocs.io/en/latest/deploy.html [settings] default = scraper.settings [deploy] #url = http://localhost:6800/ project = scraper
Inspecting the code output, we will find the data yielded in a more structured way as a ...