Introduction
Learn about the decisions we make regarding the destination repository of ETL pipelines.
We'll cover the following
In the final step of the ETL pipeline, the transformed and processed data is stored in a repository for future analysis, processing, reporting, and decision-making. Several choices need to be made before the loading process begins, some of which are straightforward and determined by the business requirements, while others require careful consideration.
One of the straightforward decisions is selecting the type of repository to use. Different repositories have their own specific purposes, including relational or non-relational production databases, data warehouses, and data lakes.
The choice of repository depends on how the data will be used. For example, if the data is intended for large analytical queries, it should be loaded into a data warehouse. If it’s for many simultaneous transactions, it should be loaded into a production database. If the data is structured as documents, it should be loaded into a non-relational document-based database. If the data is unstructured, it should be loaded into a data lake.
Other than choosing the type of repository we need to load the data into, we also need to decide on a few other things, such as:
Choosing a specific vendor for the repository, such as PostgreSQL, MySQL, MongoDB, or BigQuery
Hosting and deployment options, such as on-premise or cloud-based, and whether to use open-source or proprietary solutions
The approach we use for data loading such as load based on a predetermined schedule or on-demand
Data governance and security considerations
Cost and maintenance considerations
This section will focus more on those types of considerations.
The main goal is to choose a suitable repository and ensure that the data is loaded in a way that supports the intended use and meets the business requirements while minimizing the costs and complexity of maintaining the destination repository.
Key metrics
When considering the above choices, several metrics must be considered. The ideal solution will strike a balance between the following factors in the choice of a destination repository:
Data volume: It must be able to handle the volume of data being generated by the pipeline, both in terms of storage and processing power.
Data access: It must provide adequate access to the data, such as read/write capabilities, security and authentication, and query performance.
Scalability: It must be scalable to accommodate future growth in data volume and complexity.
Integration: It must be compatible with the ETL pipeline and other systems and applications that need to access the data.
Cost: It must be cost-effective and provide value for the investment.
Performance: It must provide adequate performance for the required data operations.
Maintenance: It must be maintainable, with adequate documentation and support from the vendor.
Data governance: It must provide the necessary data governance capabilities, such as data quality, privacy, and retention.
With these considerations in mind, let’s discuss some options for choosing how to deploy and load data to the destination repository.
Get hands-on with 1400+ tech skills courses.