Scrapy Data Pipeline

Learn how Scrapy organizes the data pipeline and exports it in any structured format.

Having familiarized ourselves with Scrapy's fundamental modules, which empower us to extract information from various websites, it's time to explore exporting our scraper's output in a structured format.

Core modules

Scrapy offers a systematic approach to organizing the data we scrape in unstructured formats that can be easily employed for various purposes. It achieves this through the utilization of three core modules:

Press + to interact
Scrapy outputs modules
Scrapy outputs modules

The diagram below illustrates the fundamental connections between these modules:

Press + to interact
Scrapy output modules
Scrapy output modules

Spider.py is the core scraping spider code. It utilizes Items.py with ItemLoader to containerize the scraped data, then using ItemPipeline.py to perform final processing on the data and save it in a structured format.

Items

Items are simple containers that hold the data we want to extract from a website. They serve as a structured data representation and help us maintain consistency in our scraped results.

Items are defined using Python classes that inherit from scrapy.Item inside the Items.py file. Each attribute of the item class represents a piece of data we want to extract. By defining the fields in the item class, we specify the data structure we will scrape.

Here’s a basic example of defining a Scrapy item for scraping quotes from the Quotes to Scrape website:

import scrapy
class QuoteItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()

In this example, the QuoteItem class represents a quote with its corresponding text, author, and tags. Field objects are used to specify metadata for each field. We can specify any metadata for each field. There is no restriction on the values accepted by Field objects.

Once we've defined our item class, we can start using it. Within our spider's parsing methods, we can create instances of the item class, assign values to its fields, and yield the populated item.

# Automatically created by: scrapy startproject
#
# For more information about the [deploy] section see:
# https://scrapyd.readthedocs.io/en/latest/deploy.html

[settings]
default = scraper.settings

[deploy]
#url = http://localhost:6800/
project = scraper
Scraping data from quotes website by utilizing Items

Inspecting the code output, we will find the data yielded in a more structured way as a ...