Loading and Generating a Dataset
Learn to load an existing dataset and process the data to generate a dataset into a TF consumable format.
We'll cover the following
Data pipeline
A series of data processing elements form a data pipeline. ML and DL algorithms commonly use data pipelines because they need a huge amount of data to build a reasonable model. A pipeline can generate a dataset, load the dataset into memory, and perform data cleaning and transformation. Furthermore, it divides a large dataset into batches manageable by the TF framework.
A data pipeline can consume data from various data sources, such as:
NumPy arrays
Comma-separated values (CSV) files
Text data
Images
Data ingestion, a process to import data files from multiple sources to a single storage, might be needed before analyzing and preparing the data for the TF framework. Once we have our data, we employ a three-phase process to prepare the input data in a format consumable by TF models:
Extract data from various sources (main memory, local disk, and cloud).
Transform (clean, shuffle, etc.) data.
Load data in an output container.
This is the basic process of extract, transform, and load (ETL) that constitutes a data pipeline. The following figure shows an example of a data pipeline between datasets and a model.
Get hands-on with 1400+ tech skills courses.