Data Ingestion
Learn about the many different ways data can be migrated into a centralized location.
Data ingestion is the process of migrating data from its various sources to a centralized location or storage area.
Why would we need to ingest or migrate data? A primary reason is to create one place where analytics can be done on the entirety of the available data. Without data ingestion, analytics can still be done from individual data sources. However, this siloed (isolated) approach can get overly complex and doesn’t provide a holistic way to discover insights from relevant data.
AWS services for data ingestion
AWS includes a variety of services to ingest the data into a data lake.
AWS Database Migration Services
Allows the migration of data from one database to another.
Possible to migrate between the same database engine (e.g., from an Oracle database to another Oracle database) or to migrate between different database engines (e.g., from an Oracle database to a MySQL database).
To be able to use AWS Database Migration Services (DMS), one of the databases must be hosted on AWS.
AWS DataSync
Allows migration of data between storage systems such as Network File System (NFS) file servers, Server Message Block (SMB) file servers, Hadoop Distributed File System (HDFS), object storage systems, and Amazon services including Amazon S3, Elastic File System (EFS), and AWS Snowcone devices.
Also supports archiving
to long-term storage on AWS through services such as S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive.cold data Rarely-accessed data
Amazon Kinesis
Allows the ingestion of real-time streaming data so that processing and analysis can happen as soon as the data arrives.
can be captured and processed by Kinesis Data Streams, Kinesis Data Firehose, Kinesis Video Streams, and Managed Streaming for Apache Kafka.Streaming data Data that’s continuously generated, including website clickstreams, video, audio, and application logs.
Amazon Managed Streaming for Apache Kafka
Allows the use of Apache Kafka, an open-source platform for ingesting and processing streaming data.
Amazon Managed Streaming for Kafka (MSK) provides additional functionality for managing and configuring servers for Kafka-based applications.
Amazon MSK also attempts to detect and automatically recover from common failure scenarios for Kafka clusters so that related applications can continue operating.
AWS IoT Core
Allows the connection of
devices and messages to AWS services.Internet of Things (IoT) The inclusion of sensors and other technologies into physical devices that can then transmit data through a communications network (e.g., the public internet). AWS IoT Core is designed to connect billions of IoT devices and route trillions of messages to AWS services for further processing and analysis.
Amazon AppFlow
Allows the migration of data between SaaS applications and AWS services without writing code.
Supported SaaS applications include Salesforce, Marketo, SAP, Zendesk, Slack, and ServiceNow.
Supported AWS services include Amazon S3 and Redshift. Each flow can run up to 100 GB of data, which allows for millions of SaaS records to be transferred for further processing and analysis.
AWS Data Exchange
Allows the integration of third-party data from an AWS-hosted data marketplace into Amazon services such as S3.
Data providers include Reuters, Foursquare, Change Healthcare, Equifax, and many others. The data products span industries, including healthcare, media and entertainment, financial services, and more.
After subscribing to a data product, customers can use the subscribed data from within other AWS services and also be alerted by an Amazon CloudWatch Event when updates to the data become available.
Real-time vs. batch ingestion
Some of the available services, such as Amazon Kinesis, are designed to ingest data in real time as soon as it arrives. This approach is effective for use cases where it’s important to process and analyze data as soon as possible (i.e., within seconds)—for example, when decisions must be made based on up-to-the-minute information.
For many other use cases, it's sufficient to migrate data in batches. The migration schedule can be configured at various intervals according to the use case (e.g., every hour or every day). The flexibility to use batch ingestion reduces energy consumption and can be more cost-efficient with no significant degradation to the user experience (as compared to real-time ingestion).
Ingesting data beyond AWS
While AWS includes many services for data ingestion, there are even more services outside of the Amazon ecosystem!
Instead of ingesting data into a data lake, companies can choose to set up a data warehouse and migrate data from various sources directly into the data warehouse.
Just a Few Places Where Ingested Data Can Go
Data Lakes | Data Warehouses |
Amazon S3 | Snowflake |
Azure Data Lake | Google BigQuery |
Google Cloud Storage | Azure Synapse Analytics |
Databricks | Amazon Redshift |
The company Databricks, based on the open-source Apache Spark data-processing platform, even started using the term “data lakehouse” to describe how their product can be a hybrid of both a data lake and a data warehouse.
The term Extract, Load, Transform (ELT) is another way to describe the migration of data from various sources to a centralized location. ELT tools include Fivetran and Airbyte. These tools can migrate (“extract” and “load”) data to data warehouses such as Snowflake, BigQuery, Azure Synapse Analytics, and Amazon Redshift, as well as to data lakes such as Amazon S3.
To illustrate the wide variety of data that can be ingested, here are just a few of the hundreds of data sources supported by Fivetran.
Just a Few Data Sources Where Fivetran Can Ingest Data From
Databases | Marketing and Sales | Product and Engineering | Support and Operations |
Oracle | Instagram Business | GitHub | Zendesk |
MySQL on AWS RDS | TikTok Ads | SurveyMonkey | Stripe |
MongoDB | Salesforce | Google Sheets | Dropbox |
Amazon Aurora PostgreSQL | Google Analytics | FTP and SFTP | Greenhouse |
Amazon DynamoDB | Shopify | Azure Cloud Functions | Workday HCM |
Fivetran offers data ingestion services at different price points depending on these factors:
Users: From 1 to 10 to unlimited users of the tool.
Usage level: From up to 500,000 monthly active rows migrated to an unlimited number of rows.
Frequency of ingestion: From synchronizing every 1 hour to every 15 minutes and every 5 minutes.
While the terminology around data ingestion might change, the core concepts are still the same: to be able to migrate data from various sources to a centralized location for further processing and analysis (and within the required time frames).