Apache Beam
Introduction to Apache Beam and its example.
What is Apache Beam?
Apache Beam is an open-source library for building data processing workflows using Java, Python, and Go.
Beam workflows can be executed across several execution engines including Spark, Dataflow, and MapReduce. With Beam, you can test workflows locally using the Direct Runner
for execution, and then deploy the workflow in GCP using the Dataflow Runner
. Beam pipelines can be batch, where a workflow is executed until it is completed, or streaming, where the pipeline runs continuously and operations
are performed in near real-time as data is received. We’ll focus on batch pipelines in this chapter
and cover streaming pipelines in the next chapter.
In our pre-configured execution environment, the Apache Beam library is already installed. To skip local installation instructions and move to the word count example, click here.
Installing libraries
The first thing we’ll need to do in order to get up and running is to install the Apache Beam library.
Run the commands shown below from the command line in order to install the library, set up credentials for GCP, and run a test pipeline locally. The pip
command includes the gcp
annotation to specify that the Dataflow modules should also be installed. If the last step is successful, the pipeline will output the word counts for Shakespeare’s King Lear.
Get hands-on with 1400+ tech skills courses.