What is Apache Beam?

Apache Beam is an open-source library for building data processing workflows using Java, Python, and Go.

Beam workflows can be executed across several execution engines including Spark, Dataflow, and MapReduce. With Beam, you can test workflows locally using the Direct Runner for execution, and then deploy the workflow in GCP using the Dataflow Runner. Beam pipelines can be batch, where a workflow is executed until it is completed, or streaming, where the pipeline runs continuously and operations are performed in near real-time as data is received. We’ll focus on batch pipelines in this chapter and cover streaming pipelines in the next chapter.

In our pre-configured execution environment, the Apache Beam library is already installed. To skip local installation instructions and move to the word count example, click here.

Installing libraries

The first thing we’ll need to do in order to get up and running is to install the Apache Beam library.

Run the commands shown below from the command line in order to install the library, set up credentials for GCP, and run a test pipeline locally. The pip command includes the gcp annotation to specify that the Dataflow modules should also be installed. If the last step is successful, the pipeline will output the word ...

Introduction to Building Scalable Model Pipelines

Models as Web Endpoints

Models as Serverless Functions

Create an Echo Function in Lambda

Working with S3 in Lambda

Working with API in Lambda

Containers for Reproducible Models

Working with AWS Container Registry

Workflow Tools for Model Pipelines

PySpark for Batch Pipelines

Cloud Dataflow for Batch Modeling

Streaming Model Workflows

Course Conclusion

Apache Beam

What is Apache Beam?

Installing libraries