Introduction to Cloud Dataflow and Batch Modeling

What is Dataflow?

Dataflow is a tool for building data pipelines that can run locally or scale up to large clusters in a managed environment. While Cloud Dataflow was initially incubated at Google as a GCP specific tool, it now builds upon the open-source Apache Beam library, making it usable in other cloud environments.

The tool provides:

  • input connectors to different data sources, such as BigQuery and files on Cloud Storage
  • operators for transforming and aggregating data
  • output connectors to systems such as Cloud Datastore and BigQuery

In this chapter, we’ll build a pipeline with Dataflow that reads in data from BigQuery, applies a sklearn model to create predictions, and writes the predictions to BigQuery and Cloud Datastore. We’ll start by running the pipeline locally on a subset of data and then scale up to a larger dataset using GCP.

Get hands-on with 1400+ tech skills courses.