Building a Data Pipeline Using the tf.data API
Learn to build a data pipeline using the tf.data API in TensorFlow.
We'll cover the following
Building data pipelines using tf.data
tf.data
provides us with a convenient way to build data pipelines in TensorFlow. Input pipelines are designed for more heavy-duty programs that need to process a lot of data. For example, if we have a small dataset (for example, the MNIST dataset) that fits into the memory, input pipelines would be excessive. However, when working with complex data or problems, where we might need to work with large datasets that don’t fit in memory, we augment the data (for example, for adjusting image contrast/brightness), numerically transform it (for example, standardize), and so on. The tf.data
API provides convenient functions that can be used to easily load and transform our data. Furthermore, it streamlines our data ingestion code with the model training.
Additionally, the tf.data
API offers various options to enhance the performance of our data pipeline, such as multiprocessing and prefetching data. Prefetching refers to bringing data into the memory before it’s required and keeping it ready.
Creating a data pipeline
When creating an input pipeline, we intend to perform the following:
- Source the data from a data source (for example, an in-memory NumPy array, CSV file on disk, or individual files, such as images).
- Apply various transformations to the data (for example, cropping/resizing image data).
- Iterate the resulting dataset element/batch-wise. Batching is required because deep learning models are trained on randomly sampled batches of data. Because the datasets these models are trained on are large, they typically do not fit in memory.
Get hands-on with 1400+ tech skills courses.