What is Apache Beam?

Apache Beam is an open-source library for building data processing workflows using Java, Python, and Go.

Beam workflows can be executed across several execution engines including Spark, Dataflow, and MapReduce. With Beam, you can test workflows locally using the Direct Runner for execution, and then deploy the workflow in GCP using the Dataflow Runner. Beam pipelines can be batch, where a workflow is executed until it is completed, or streaming, where the pipeline runs continuously and operations are performed in near real-time as data is received. We’ll focus on batch pipelines in this chapter and cover streaming pipelines in the next chapter.

In our pre-configured execution environment, the Apache Beam library is already installed. To skip local installation instructions and move to the word count example, click here.

Installing libraries

The first thing we’ll need to do in order to get up and running is to install the Apache Beam library.

Run the commands shown below from the command line in order to install the library, set up credentials for GCP, and run a test pipeline locally. The pip command includes the gcp annotation to specify that the Dataflow modules should also be installed. If the last step is successful, the pipeline will output the word counts for Shakespeare’s King Lear.

Get hands-on with 1300+ tech skills courses.