Introduction to Cloud Dataflow and Batch Modeling
Introduction to streaming model workflows.
What is Dataflow?
Dataflow is a tool for building data pipelines that can run locally or scale up to large clusters in a managed environment. While Cloud Dataflow was initially incubated at Google as a GCP specific tool, it now builds upon the open-source Apache Beam library, making it usable in other cloud environments.
The tool provides:
- input connectors to different data sources, such as BigQuery and files on Cloud Storage
- operators for transforming and aggregating data
- output connectors to systems such as Cloud Datastore and BigQuery
In this chapter, we’ll build a pipeline with Dataflow that reads in data from BigQuery, applies a sklearn model to create predictions, and writes the predictions to BigQuery and Cloud Datastore. We’ll start by running the pipeline locally on a subset of data and then scale up to a larger dataset using GCP.
Get hands-on with 1200+ tech skills courses.