Validation with JSON Schema

Learn about JSON Schema and how to validate the scraped data.

Up to this point, we have been scraping data from various websites without verifying its accuracy. Data is not always complete and error-free; there will often be missing fields or incorrect values. Given that we are developing an automated script to scrape millions of records, it is crucial to implement a validation mechanism to ensure the quality of the scraped data.

The JSON Schema library in Python

JSON Schema is another powerful tool that Python provides. It is an implementation of JSON Schema that allows us to check if our JSON data is structured correctly. Since we often use dictionaries to organize data in our web scraping scripts, JSONSchema can come in handy to ensure our data is in the correct format before we do anything else.

It can also be used with other libraries such as Selenium or Beautiful Soup. Scrapy items pipeline is the perfect fit for this situation. With the items pipeline, we can set up a process to validate each scraped data before moving on to further processing.

Press + to interact
JSON Schema with Python
JSON Schema with Python

Installation

We can install the jsonschema library in any Python environment by running the following command:

pip install jsonschema

Syntax

The jsonschema uses a JSON-based syntax to define the structure and constraints of JSON data. For instance, if we are scraping product information, our schema might define properties like name, price, description, and their respective data types.

{
"type": "object",
"properties": {
"name": { "type": "string" },
"price": { "type": "number" },
"description": { "type": "string" }
},
"required": ["name", "price"]
}
A sample schema

Essential components

The essential components of JSON Schema ...