Manage Data Quality in dbt
Learn how to run data test in dbt.
We'll cover the following
To achieve high data quality, testing stands as an indispensable component within the data pipeline. In this context, we will discover the testing framework in dbt, which empowers developers to verify various data metrics in an automated fashion.
Data test vs. unit test
Before writing a test, let's make sure we understand the distinction between a data test and a unit test. A data test is a runtime evaluation conducted within the production pipeline that assesses data quality from various perspectives like freshness and uniqueness. It typically runs after the table is updated with the latest data. For instance, if the source data introduces a new category
value, the data test should detect the change and send an alert if necessary. Its role extends to detecting both data issues coming from the source as well as logic errors within the data warehouse.
On the other hand, a unit test is conducted before deploying the code change to production, and it focuses on testing the transformation logic rather than the data. During a unit test, a dedicated test dataset with expected input and output is prepared instead of using unpredictable production data. The goal is to identify logic bugs, such as inaccuracies in metric formulas or the absence of a filter before the release. In the next part, we will focus on the data test in dbt.
The dbt test
The tests in dbt are select
statements that search for failing records, records that don't meet certain conditions. Out of the box, dbt provides a few built-in tests, such as unique
, not_null
, accepted_values
, etc. The dbt-utils package provides more advanced tests such as at_least_one
, not_null_proportion
, etc. We can also create our own test package and share it with others.
Here is an example of the dbt test. Tests are defined in the model configuration file.
Get hands-on with 1400+ tech skills courses.