Loading and Generating a Dataset

Learn to load an existing dataset and process the data to generate a dataset into a TF consumable format.

Data pipeline

A series of data processing elements form a data pipeline. ML and DL algorithms commonly use data pipelines because they need a huge amount of data to build a reasonable model. A pipeline can generate a dataset, load the dataset into memory, and perform data cleaning and transformation. Furthermore, it divides a large dataset into batches manageable by the TF framework.

A data pipeline can consume data from various data sources, such as:

  • NumPy arrays

  • Comma-separated values (CSV) files

  • Text data

  • Images

Data ingestion, a process to import data files from multiple sources to a single storage, might be needed before analyzing and preparing the data for the TF framework. Once we have our data, we employ a three-phase process to prepare the input data in a format consumable by TF models:

  • Extract data from various sources (main memory, local disk, and cloud).

  • Transform (clean, shuffle, etc.) data.

  • Load data in an output container.

This is the basic process of extract, transform, and load (ETL) that constitutes a data pipeline. The following figure shows an example of a data pipeline between datasets and a model.

Get hands-on with 1200+ tech skills courses.