Introduction to Cloud Dataflow and Batch Modeling
Explore how to create scalable batch processing pipelines with Cloud Dataflow, focusing on Python implementations. Understand key components like pipeline, DoFn, and transforms, and learn to apply machine learning models on data within a managed cloud environment.
What is Dataflow?
Dataflow is a tool for building data pipelines that can run locally or scale up to large clusters in a managed environment. While Cloud Dataflow was initially incubated at Google as a GCP specific tool, it now builds upon the open-source Apache Beam library, making it usable in other cloud environments.
The tool provides:
- input connectors to different data sources, such as BigQuery and files on Cloud Storage
- operators for transforming and aggregating data
- output connectors to systems such as Cloud Datastore and BigQuery
In this chapter, we’ll build a pipeline with Dataflow that reads in data from BigQuery, applies a sklearn model to create predictions, and writes the ...