Introduction
Get an overview of the data sources and formats in the ETL pipeline’s initial stage.
We'll cover the following
In today's world, data is everywhere. It’s constantly generated and stored in various sources. To effectively transfer data, it’s essential to first learn how to extract it.
This section introduces a range of techniques for extracting data from diverse sources such as relational and non-relational databases, cloud data warehouses, APIs, web scraping, and more. These skills will help us build ETL pipelines capable of extracting data from various sources in different formats and for multiple purposes.
Data sources
The method for extracting data from each source varies depending on the specific characteristics of the data, the source itself, and the purpose for which we extract the data.
Some common data sources include:
Relational databases: They store data in structured tables and are simple to extract. Examples include MySQL, Oracle, and PostgreSQL.
Non-relational databases: They store data as documents, key-value pairs, columns, or graphs. Examples include MongoDB, Neo4j, and Apache Cassandra.
Data warehouses: They serve as central sources for business users and store structured data. Examples include Snowflake and Google BigQuery.
Data lakes: They store structured, semi-structured, and unstructured data at a large scale. Examples include
APIs: They can provide quick data access from web-based apps and online services.
Web scraping tools: They’re used to extract HTML content from web pages.
Streaming sources: Only the ones that generate real-time data and require specialized handling. Examples include social media feeds.
Cloud repositories: They store vast amounts of data and are commonly used for extraction.
Get hands-on with 1400+ tech skills courses.